 Welcome everyone, thank you all for coming. This webinar is entitled Analyzing Educational Data with Open Science Best Practices R and OSF. And we're gonna talk about a bunch of things. So the plan for today, we have a little bit of an introduction and some goals of the webinar. And then I will go through a section on some strategies for using R to analyze educational data. And then Josh is gonna talk about some different things around accessing data on OSF through R using the OSF R package specifically. And then hopefully the plan is to have about 15 minutes or so at the end for Q and A from you guys. So hopefully that works for everybody because that's what's going on. Okay, so this starts a little bit about us. So I'm Cynthia D'Angelo, I'm assistant professor at the University of Illinois at Urbana-Champaign. I'm in the educational psychology department and the curriculum and instruction department. My background is in physics and science education research. I do a lot around learning technologies and learning analytics. My Twitter is there. There's my email as well. If you'd like to get ahold of me, Josh, you wanna introduce yourself? Sure, thank you, Cynthia. Hi everybody, my name is Joshua Rosenberg. I'm an assistant professor at the University of Tennessee, Knoxville. And I work in science teacher education primarily, but I also have interest in data science education and applying data science methods in my research as well. I'm at J Rosenberg 6432. And pleased to have the chance to have you all join us today. And I have a dog sleeping in another room and hopefully she'll stay asleep, but just halfway, I actually might decide to bark at something. So we'll just keep going. All right, so some of the goals for today, A is to convince you to use R. I think Josh and I are both kind of evangelists for R in some sense. We've been using R for a long time in a lot of different ways. And we think it's great that we want you to also use it. And so hopefully the underlying kind of theme for today is like you should be using R as well. And here's some things that we've learned over time that are gonna make it easier for you to do so. So we'll talk about strategies, really kind of thinking about the different complexities that come with educational data and how you can use different kinds of strategies with R and other kind of related things to make that easier for yourself. There are some OSF specific tools that we'll talk about as well as part of your workflow, if that makes sense. And then hopefully we'll have time to answer some questions about what's going on. Okay, so in terms of logistics, a few things. I'm gonna put this link in the chat as we go through, oh, it's going to the panelists. Cool, that's not gonna, hold on. There you go, that's for everybody. So we'll try as best we can to put the links that we mentioned in the chat as well. But that link that I just put into the OSF repository has copies of these slides as well as some other things. And so you can find those there and that hopefully will help people as well. There is a Q&A box. So if you have questions for the Q&A part of the end, please put your questions there. If you have a clarification question, you can try to ping us in the chat. I don't know how well that's gonna work, but we can try that and then we'll try to answer the clarification questions as we go. But at the very least, there'll be time at the end for answers to questions. Okay, all right. So a little bit, hopefully everyone's kind of familiar at some level about what open science is. The Center for Open Science is hosting this. So I think everyone's at least familiar with kind of that term. But the basic idea is that these are different kinds of methods for the process, the content, the outcomes of research are openly accessible to others. There's a bunch of shared values of openness, integrity and reproducibility. And I think one of the things we wanna talk about is that this requires a cultural change from people in order for this to really become, you know, a standard kind of way of operating going forward. And what I think Josh and I both want everyone to kind of take away from this is that there are small changes that you can make within your own practice that can help lead to and reinforce this change. So it really is gonna start by you doing some things differently, talking to other people about how you've done those things differently, showing the openness and showing how these different things are going forward. It doesn't mean you have to change your entire practice or like completely like switch everything. So everything's open, like really just try to focus on small parts of your workflow of your process that you can make more open and kind of go from there and build on that over time and help others see that and have it spread that way. And also, as I think Josh will talk more about, like can and this probably can and probably should be responsive to your actual problems and concerns. So like part of what we want you to do is also reflect on which parts of your process are not working well for you right now and like how different kinds of open practices might be able to help solve some of those issues. And then so I think one thing I wanted to talk first about was that there are some of our commitments that I think are important that again, underline all of our work, but maybe in subtle or ways that maybe aren't clear. And I wanna be really clear about these things. I think this is really important. So there are a lot of different kinds of technical tools and solutions to some of these open science problems, but there are also a lot of like philosophical, ethical, moral issues that you should consider when you're trying to decide what to do with your data with your research, right? We're talking about educational data and humans are participants that helped produce your data. I think that's a huge difference with other kinds of science that are out there. And so other fields of science might have different ways of doing open science that work better for them in their specific discipline because we're dealing with humans. We have a slightly different kind of set of considerations to make I think. And so everyone should kind of consider that as they're going forward and making decisions. All humans deserve respect and so do their data. And it's important to center that as you go forward and make decisions about what to do with your data at different stages. I think also it's important to say that there's no easy answer for some of these situations that you might face and that's okay. Part of what open science asks, I think is for you to consider your options and then document your decision making about why you've chosen to do certain things with your data or your analysis or whatever. And so that transparency doesn't necessarily have to be like, here's all my data, here's all the things that are private, whatever. But really just being transparent and like why you did what you did, why you made these decisions and why you're choosing one thing or another. Like a lot of it can just be documenting your decision making around why you decided not to share something that can be very helpful as well to others. And so part of this is reflecting early on in your process about what your goals are and how you want to achieve them. So these might be your research goals, these might be, you know, other things like that. Like what are your values? How do those goals and values match up? And really like sitting with that and reflecting on that can help you decide like what you wanna do with different things. And I think the important thing is for you to think about like what is the best way for you to work towards open science? It's gonna look different for different people and for different projects and different data and that's important to recognize. Okay, all right. And then kind of before I get into like my tips and tricks like why are, so, you know, I think the main thing about this sort of open science paradigm is that it requires transparency with your data and your analysis and R is open source and free and accessible. There's a huge community of people, you know documenting helping others. And so like if you think about in the future when like, you know, you might need to, you know put your code up somewhere or whatever it is R is something that everyone can access and everyone can use and see. And so it's not something behind a paywall or something that not everyone is familiar with. And so it's all of the code is available and everyone can see what happened. So that's a really important part of it. Also it's continually improving and there's people trying to help each other all the time get better at it. That's an important part of the whole thing for me. Okay, all right. So I'm gonna go through a list of different kinds of strategies. I'm gonna try to do this as quickly as possible because there's a lot of them and I wanna make sure we have time for questions and things like that. Okay, so first, why is educational data special? Josh, we did like an un-conference session like a month ago. I don't know, time doesn't mean anything more. It was a while ago. And we're talking about like complex educational data. And I started thinking too about like why education data is complex. It's different than other data. I'm like, so I think thinking about like why it's complex and why some of these things we're talking about maybe like more important to do and to think about if you have educational data, right? So the educational settings are complex. And so therefore the data coming out of those settings is complex, right? There's just kind of how it is. You might have multiple data sets. You might have different kinds of data. You might have multiple measurements within one data set. You might have huge data sets. You might have unstructured data. There's usually a lot of complexity in the nuance across sites and or participants. And then again, you have human subjects and FERPA requirements and all these different things like they add up and like you might have multiple things like this. So there's a lot of things to consider. It's not as straightforward as some other types of data perhaps. So some common issues that might come up are like missing data and complete data, some missing context, incompatible scales, multiple levels like individual group level data. I do a lot of collaboration kind of work, multiple stakeholders with different needs and perspectives, right? So like maybe your funding agency has one kind of goal for your work and the district leaders have one, another one and the teachers you worked with have another one and you have a different one, right? There's always different kinds of competing needs sometimes about what's going on and how to communicate those things. So it's important to kind of keep all those in mind. Okay, so here we go, tips and guidelines. So again, remember that things that improve your process and code for you, for a future you and for your internal team will also likely make it easier for outsiders to see what you did. So a lot of these tips may not seem like, oh, these are like good open science things, but like there are things that are gonna help you do your work better probably and therefore if they're gonna help you do it better, they're gonna help other people see how you did the work in better ways. Again, I think it's important to remember that documentation which a lot of what this is, right? It is really hard to do. It is worth it. And if you can devise different kinds of strategies to make that documentation process easier for you, that's great, that'll make it a little bit easier. The earlier you process, you start the better, get good habits around it and that will help. Okay, so first data structures. This maybe if you're familiar with the tidy verse is gonna be an obvious one of you, but if the tidy verse, if that word is new to you, great, welcome, we're gonna talk about that probably a little bit today. So there's this tidy data paradigm. So if you go to tidyverse.org, that's where the main package kind of information is there. And the are for data science book, which is a really great book if you're trying to learn more about our, but you can also access it for free on this website which is handy. Basically the idea is that your data should be in a very specific kind of structure. So if every column represents one variable, each row is one observation. And one observation is might be different, right? You need to think kind of carefully about what is an observation in your dataset. And then each cell then is gonna be a single value, right? And so with data you get from someone else, like if you didn't collect the data necessarily, you're doing secondary data analysis, someone else did it, like it may not be in the structure. And so you need to reshape it and do that. The tidyverse is a series of packages, including dplyr and some other things that help you reshape your data in different ways. And that can be really helpful. So a lot of people who use R, a lot of people use the tidyverse and everyone kind of tries to get it into this tidy data paradigm. And therefore it's a little bit easier to know like what's going on. And so this is another way that kind of consistency across different people is gonna make it easier for other people to see like what you're doing and what's going on. You might also have other structures like relational databases. R is getting a lot better about how to use databases directly within RStudio. I tried doing this a couple of years ago in a product and it was like working okay, but it's gotten much better in the last year or two. And so that's really exciting. There's a bunch of packages that help with database access directly within R. And I'll talk a little bit later about some other tools that will help with that as well. And then you might need to do things like create subsets of your data and that can help as well. Okay, also data dictionaries are really important. Not just kind of like for understanding what's going on right now, but like just again, for future you or for like let's say you put a paper and then six months later you have to do some revisions and you're like, wait, what was this variable? Like what did it mean? Like how did this get coded right? If you have everything really clearly laid out and documented in one place and I'll tell you in a minute what that one place should be, then everything is gonna be clearer and easier for you and for people you're trying to communicate with, right? So you wanna describe the meaning of each of your variables, other kinds of context and other information about the observations. You should do this even if you know what they mean because maybe in the future you won't or maybe someone else that you're looking at and they won't know. And so it's always good to document for your future self, for your collaborators, for other people what's going on. And you can do this within an R notebook which spoiler alert is gonna be my main takeaway for the thing which I'll talk about at the end of my section. Okay, thinking also about kind of the infrastructure of your data more generally. So like where is your data? You should know where your data is. There are a lot of issues around data storage and data access. You should think about those early and often. Your IRB probably has opinions about where your data should be and you should listen to them. Spoiler alert, Google Drive is not a good place for your data generally speaking. So you should figure that out because where your data is is gonna impact how you access your data when you're doing analysis and especially have a lot of it that can become an issue. So planning for data infrastructure and how that relates to your code and how you're gonna do the analysis is a really important consideration to make. I think it's sometimes overlooked. Okay, version control is also important. I'm not gonna belabor this too much. There are a lot of options, not a lot. There's some options here. Version control is one good way to kind of keep track of what's going on and document kind of changes that are being made. You can use get and get hub or SVN. I'm not gonna go into too many details but I think Josh will talk more about get later on. So that's good. But again, our notebooks will work really well with version control. So that's another good way to do things. Okay, this one also, this is one of my kind of very particular things that I always try to get people to do very early on in projects is like common file naming conventions. Like the more you can do this, the easier everything is gonna be afterwards. And not just like in terms of like doing your analysis but also like in terms of like finding things and understanding what files are and like what data is in different files. And so again, like I usually do a lot of different things with audio, visual data, there's a lot of different kinds of files over multiple days or whatever. So the more information you can contain in the file name and the more organized it is, the better it's gonna be for you to understand like what's going on in different places. And then also in terms of writing code and functions for things, if the file name contains information then you can use different kinds of functions to pull that out and add it into the files when you're trying to do analysis. And that can be very, very helpful, right? So again, documenting these conventions, making sure all of your collaborators adhere to them early on is really important. Keep it flexible, but also it should make sense and thinking about like sorting things. Like for instance, in date formats, like put the day last, because that's the smallest thing and it should go last. So that when you sort it goes in the right order, okay? And I think the other thing too is like thinking about the unit of analysis that you're gonna care about later. So for instance, like, if you wanna do some analysis at the teacher level or at the class level or at a student level or a group level, like put that in the file name so that you could then sort things based on that or subset based on that. Like these are all the files that are teacher one. These are all the files that have group seven in them. And then it's like really easy to keep things organized and really clear about what's going on. And the more clear it can be is good for you. It's good for your collaborators. It's good for any future person that might need to look at that. And so that's really important. Okay, functions. I know if you're new to R, functions are kind of scary sometimes, but you should ask yourself like, are you copying and pasting your code a lot? If you are, you should probably use a function. And functions can really speed up the process and they can also provide more consistency and you can share them with others once you've like figured out how to do something. And that's really nice. So functions are your friend. The R for Data Science book has a really nice section on functions. So if they are scary to you, don't worry, there are a lot of resources out there that can help you do that. Okay. And then like visualizing your data is really important. I teach an entire class on this using ggplot. So I can't talk about it very well in one minute, but I just wanna kind of really point out like it's really important product of your process and not just an end product, especially of complex data. You have to look at your data. As you probably know, like descriptive statistics can be very misleading like this classic example, Anscombe's Quartet, like all of these four plots have the exact same descriptive statistics, right? But they clearly look very different. You would do different kinds of analyses on them based on what they look like, right? So you have to look at your data and the more complex your data is the more complicated this is gonna be. And so it's just an important part of the whole thing. Histograms can be really helpful and they're really easy to do in ggplot, not easy to do in Excel, fun fact. So that's a good thing to do. Facets are also another really big thing to do. I don't have time to talk a lot about all of my tips related to visualization, but actually what I will do is I will put in the chat. My website, I have a series of blog posts on R. So like there's some that are about like how to learn R, how to get started with R, especially if you are used to using SPSS or you're used to using Steadar or something else that you need to switch over. Like I got you, don't worry. I was a big Excel person before I moved over to R. So some of those posts are like why you should use R, like what are the basics? And then there's one on dplyr and data manipulation and then another one on data visualization and there will eventually be one on shiny. So look forward to that. Okay, all right. So what I want to spend the next couple of minutes talking about are marked down in our notebooks because I think these are kind of for me a game changer and like when those came out a few years ago in our studio, it really made things a lot easier to do to do a lot of these documentation things. It's really, really improved. And so if you're someone who like started using R a while ago but like you didn't really like see the benefit in doing R notebooks or R marked down, you're still doing R scripts, that's fine. But I'm going to try to convince you right now to switch to using R notebooks exclusively. Cause for me, they've really made a huge difference in kind of every aspect of my work and it's been really important. So one thing is that it is a markdown file. So it's plain text, so it works really well with version control, which is great. Also plain text is just better in almost all cases. One of the key things that I think is really important is that what it does, and I'll show you this in a minute is it produces an HTML version of your file and that is accessible via browser by anyone. So even if like you have a collaborator or someone who doesn't have R installed, they don't know anything about R, they can still view everything and see your code and see your graphs and see all the texts you've written. And so it's a really easy way to share what you're doing with other people and like that's a really key kind of part of that. One of the nice things about R notebooks is you can mix together any kind of explanatory texts that you have, your codes and your figures. It's all in one place. It's all like in a coherent narrative and that's really important. Also importantly, it's not limited to R. You can execute code in Python, SQL, SQL, or we can call it JavaScript, Bash, like a bunch of other things, they work pretty well. So that's great. There are a number of important features. One is persistent code execution, which is really nice. You can see multiple plots at one time which for me is like really, really important. There's interactive tables. There's formatting, which is nice. And there's lots of different output options. Some of those output options are different kinds of documents. So you can make interactive R notebooks. You can do PDFs, Word documents, rich tax, et cetera. You can make slides. These slides were not made in that, but you could do that. You can even make dashboards directly in an R notebook instead of doing it through Shiny or something. You can make a GitHub document. You can write a book using book down in there. And it will also make websites since it's HTML based. So it's using Pandoc on the back end if that is helpful to anyone. Okay, so what I'm gonna do now is I'm going to share my screen, the whole thing. And show you an example. Okay, so this is our studio. And, okay. So this is a, this example, our markdown is on the OSF repository website. So if you wanna go there. So if you wanna start a new R notebook, you can go up here. And instead of doing R script, do an R notebook. And it will look like this. This part up here at the top is very important. So don't delete that. You can delete everything else. But this part of the top is actually what makes it an R notebook instead of other things. There's different options here. I don't have time to go into all those. But if you go to the R markdown.rstudio.com website and just just put that in the chat, that has all of the different options. I wish I could tell you about all of them, but we don't have time. Okay, so this one is trying to show you basically what happens is like you can type regular text. You can use markdown. You can put links in. You can put then what's called a chunk. So this is an R chunk that says R, this is an R chunk. You can execute the code. It will run the code, which is really nice. But, and then you can like write something else. So you like you do a thing and you can write about it right there all in the same place. And then you can run, let's say make a graph. The graph shows up here. Instead of the graph going over into your little plots thing over here and hiding and then like going away when you make another plot like this is gonna be here and I can scroll up and down and it will just be there the whole time. I can make another graph right below it and keep looking and comparing them, whatever. It's really nice that it's always there. And it's really, really helpful. So markdown, if you're not familiar with that some of the basics are you can do all sorts of different things, very clearly headers. Our markdown has a lot of different options including like table contents or things like that. Here's an example of a table. So you can see like you can kind of like scroll through your data within the R notebook. See all the different things that shows you your variables the types and et cetera, all those different things. Here's an example of like a data dictionary kind of thing. So you can just type them directly in the R notebook right after your data that shows up right the whole thing. And then the other thing before I switch over to Josh is one thing is it has inline code execution which is really nice. So like let's say you wanted to say like oh the mean of this variable here and this blah, blah, blah. So what happens is when you preview the document here and blah, blah, blah pops up, here we go. And so you can see like all the markdown has now rendered properly, here's the link and then like you can see all these things and like look the numbers are here just like that. And so if you have instead of like trying to like copy down what the analysis is you can just have the code directly in your text and then it will update automatically if anything changes with your data or anything else. So there's an example of what this would look like. Again, this is an HTML file. So you can just send this to someone else and they can just open it and they can see all these things. There's a lot of options in terms of like hiding your code or showing your code and things like that. But you can do this and all sorts of other kinds of fun formatting options. Again, this example is up on the repository for you to look at. Okay, I think I need to switch over to Josh. So if you have any questions put them in the question box and we will talk about it later. All right. Great, thank you Cynthia. Can you hear me okay? Great, okay. So in the second half of the webinar I will share an example of addressing a couple of different kind of open research practice related challenges through a set of tools including OSF and and R. At the end we'll have time for questions too. So I'll try to present in about 15, 20 minutes and leave plenty of time for your questions at the end. We'd love to this to be kind of involving a bit of a discussion among everyone here. On that note, I saw a lot of familiar names in the participants. Thank you all for taking the time again to join. Including a student, Omiya Sultana who helped with the data that I'm gonna use who created the data that will be used in the example that I'll share. Okay. So I'm gonna share my presentation. Can you all see a title slide? I think you need to do the plain window and then share it. So maybe I'll share and then redo it. Still showing that. How about now? That looks good. Thank you. Okay. So I'll focus on accessing data on OSF through R using the OSFR package. But first, a little bit of background. I'm coming to open science in response to some problems or opportunities that I've come across first as a graduate student and now as an assistant professor. Some of these are technical. Some of them are more kind of community related and access related. I'm gonna focus on two for now, kind of the technical issues and more community or social related problems or opportunities. First, one technical challenge I faced was reproducing my own work. It's pretty common for us to submit a manuscript including results to a conference or to a journal. And when we come back to that analysis three months later or six months later or a year later, I found it surprising and really humbling how infrequently I'm able to obtain the exact same results when I'm trying to maybe make a small change to the manuscript and wanted to make sure I can rerun everything first. So I wanted to mention sort of two examples from my work. One was for a article where I felt like I wasn't so successful. I was trying to be able to reproduce everything from the very first data processing step through the results and I couldn't do it. There were just small differences that completely vexed me. And in the end I ended up sharing some analytic code but it wasn't fully reproducible. There the code was more as an example of what was done in the analysis that others could modify but it wasn't possible to strictly reproduce that analysis. In a different paper for this computers and education study, it took a little more time and I learned some lessons but was a little bit more successful at reproducing my own work. And through this link here, there's a repository with code and a way to request access to the data used in the study so that anybody could reproduce the findings reported in that manuscript. There's also some more community related or social opportunities I've come across. So this stems from working in public and seeing some benefits from working in public. And I learned a lot from a faculty member at Michigan State University, Lee Wolfe, now at Arizona State who encouraged her students to share publicly what they were working on. For instance, after Lee gave a conference presentation she would put slides upon her website and this planted seeds that really influenced my thinking about what it means to work in public. So one example that is not so successful was my dissertation. I wrote my dissertation through a bookdown and our package that lets one create a book that includes text, plain text, as well as our code. And it was successful in a way but it wasn't in others. I don't think anybody really read it. I know nobody else contributed to that code. It was sort of a good exercise but then years later with colleagues I wrote a book using that same R package and it was thrilling in that about a dozen people made contributions as small as correcting typos to the in progress manuscript to suggesting what topics to include in it. And so here kind of an earlier attempt sort of led into a later attempt that really showed me some of the benefits of working in public, but it took some time first. So some commonalities in my work and maybe in some of yours in education are that I'm often interested in protecting private data. Not always, but a lot of the data that I analyze is data that I'm not able to share and moreover I'm not really easily able to anonymize. And so in such cases I'm not going to try to share the data along with the manuscript necessarily. Now that's not to say I won't have a mechanism through which other researchers could request access to the data but I wanna protect that process and be really thoughtful about asking anyone who uses it about how they'll use it, how they'll protect the privacy of the people, often students from whom this data was created. I'm also interested in using code to reproduce analysis for myself and then for others as well, including collaborators. I'm interested in having a platform for sharing products like documentation and preprints and postprints potentially. And then finally I'm interested in being able to link between these three dimensions of research projects. The three dimensions being protecting data, using code to reproduce analyses and having some way to share products from this project. And so a pretty good practice, let's say this humbly, maybe not a best practice but a pretty good one for the specific set of opportunities and challenges that I've encountered in my work involved this set of tools. And these tools are using Get and GitHub for sharing code publicly in a way that promotes collaboration and transparency. The open science framework for sharing data, even large data sets, either publicly or privately as well as using open science framework for sharing products, more intermediate products like reports that might be shared among the research team and also pre and post prints. And then using two R packages strategically to link between these different dimensions of research projects. And those are targets, which is an R package for creating workflows that run all of your analysis from the first step where you load your data possibly whatever the first step is in your analysis through creating something like a report or even a manuscript. And then OSFR for linking from R to OSF in a automatic way. And I chose that word sort of boldly. It's not really automatic, but maybe in a way that's not manual is what I really wanted to highlight by selecting that word here. Just a few words on why these tools, Get and GitHub, as Cynthia mentioned, is fantastic for sharing code and for using a version control system for files such as R Markdown and R Notebook documents. However, it's really not well suited to sharing data, especially larger data sets for at least two reasons. One reason is that if you ever shared data that you don't wanna share publicly, it's really a sticky problem to try to remove that from the repository for the same reason that makes Get and GitHub so fantastic for sharing code. It's always in the history. And so strictly speaking, if you included in a GitHub repository a private set of data and then you quickly removed it and then sort of went on and made changes over the next few months, that data is still stored in there in such a way that somebody could obtain it if they tried really hard potentially, but it's still there. And so it's not great for sharing data for that reason. The second reason is GitHub has a fairly strict limit on the size of data sets. And in some ways it's pretty large. I think it's around 50 megabytes. But I've been surprised by how often data sets kind of quickly grow to not really working well with that limit to their size. And so OSF works really well for sharing large data sets, even hundreds of megabytes. It's not though very good for collaborating around code, even though you can link GitHub into a project, it doesn't have the same features that Get and GitHub have that have been developed over a long period of time based on the kinds of problems that you have when you and others are working together to write something like our code. Okay, so I'm gonna dive in next to an example. And I'll chat this link as well. All right, is my screen visible, a browser window? Okay, great. So this is an OSF repository for this example. And I'll chat this link. And this OSF repository is primarily here for one purpose. And that's to share in OSF storage two files, NGSS adoption states and NGSS adoption survey, a little bit more on those in just a second. There's also links here at the very top to first of all, well, first of all, to a GitHub repository here. And this includes the code associated with this example, as well as to the repository that has our slides and a few other materials that we'll share with you after. So for now, I'm gonna focus on this GitHub repository. And this is a repository that uses those two R packages. Just to be kind of as focused on the use of this as possible, this repository could be used to reproduce this analysis. And there's three steps for doing that. The first is to access this. So you have to download this repository like this, or you could use a tool that works well with GitHub for doing that. I like get cracking and GitHub has a tool as well, GitHub desktop, I believe. Next to open the R project in this repository, I'm gonna move out of this window for a moment and show how I would do that. And then third, running the analysis with a single R code targets tarmake. And, okay, actually before I say any more, I'm gonna hop over to RStudio. So here's my RStudio window, just confirming, can you still see it? RStudio window, if you don't mind, sorry Cynthia for just checking. It's a little small, but yeah, it's good. It's a little small. Next is a little small, yeah. Okay, let me see, thank you. You can do the command positive. Is that a little bit larger? Yeah, that's good. Thank you. Lessons learned from teaching during a pandemic here. Am I sharing this game? And so I'm highlighting three files that are a part of this GitHub repository. Those three files are what's called the targets plan. A set of R functions that I wrote, you can see these are pretty simple functions here, simply reading in the data. Here it's joining together the two different data files, doing a little bit of processing, and then preparing the data for a plot. This file here is a R markdown report. So similar to the R notebook that Cynthia, that you just shared. And it's a really simple report at this point, but it dives a little bit into the specifics of what I'm trying to do here. So this is just plain text I wrote. Teachers and states that have neither adapted nor adopted the next generation science standards, report that their school is teaching the NGSS to a high degree, which was surprising because these are states that haven't adopted those standards. And states that adopted standards based on the NGSS still report teaching the NGSS. And this is code that demonstrates this. So you could also imagine this text being after that if you're kind of presenting this more as an early research finding where you're interpreting this output. So this plan here does a few things. It specifies what R packages you need to load. And then it details a number of steps that sort of walk through exactly what's being done to get from these data files through this report with this plot. And so to run it, live coding is always a little bit frightening, but I'm going to run target's tarmake, just that one code that's mentioned over here in the repository. And there's one more file I wanna mention real quick. This is kind of key. At the top of this file, there's a function in R that says, hey, run OSF.R. And what this is doing is heading over to that OSF repository that we looked at a second ago. And it's downloading those two files from the open science framework. And this is a little bit more convoluted because it's basically telling our, hey, if this file already exists, don't download it again. So I'm gonna go back here and run target's tarmake. And I'm gonna cross my fingers, it did happen. Okay, when I teach a class on R, I welcome errors because there are a chance to walk through kind of taking a deep breath and seeing what we're wrong. I think it's actually the spare comma here. R is so picky. That wasn't intentional. That was pretty anxiety producing. And so what R is telling us here is that it's running these different targets. I'm gonna move up a little bit. It just said it finished. At the very beginning, it's saying run target joined data. It's processing the data and then creating this report. I think it didn't run those steps to download the data because I already had it downloaded. But if I delete this data directory, oh my gosh, this is treacherous. And run this again. It might take a moment just to kind of download that data. But yeah, okay, this time it ran those to access data. Run those same steps in the creator's report. So what are we doing here? What does this all mean? So if we look in the docs folder, we can open this up in a web browser. And this is just the contents from the readme. But more importantly here, here's the results of the report. And this is that plot. So from this plot, we can see that teachers in Mississippi report a lower percentage of teachers report that their school is teaching the NGSS. I notice here that there's some error bars missing. It looks like in the raw data, there weren't margins of error reported for a few states. States like Rhode Island were a high percentage of respondents were reporting their schools are teaching the NGSS. So that's not too important. What's important is that all this, including this report and even this website, which was produced here, were created with a single R function that involved accessing the data, running some data processing steps and creating this report. There's also a few steps that are helpful for reproducibility here. This is from targets and it tells us sort of how the different parts of the plan work together. So state data and survey data come together and join data that's processed here in this target called join data proc. It's parsed a little bit and then prepped in this put together in this data frame that's used for three different reports. The PDF version, the HTML version we're using here and then a third document, I'm sorry, this one here that includes some session info. And so this could be helpful if we wanna see what version of R was used during this analysis. It looks like 4.0.3. I think there's a more recent version out there. Okay, so one last thing I wanna mention is that OSFR lets you work both ways. So if I add that comma back and uncomment this line here, this last line here will upload the PDF report which is in our docs folder right here to OSF. And so if I rerun the analysis, the only part that should rerun are parts that are dependent on earlier parts. So those are all skipped because those already ran successfully and nothing changed that should change that code. But I said it so that these reports regenerate every time just because if you change text in them, they won't re-render. And so if we look now over in OSF, this was already uploaded but this would upload kind of the latest report or I think it should actually upload a new version of it. Maybe because it was identical, it didn't. But this would be a way to load the final product back up to OSF at the very end. So I'm gonna share some slides again and sort of step back a little bit from the specifics. So how does this help? What does this all mean? So first of all, this allows for the possibility of protecting private data. You could imagine a case where you wanna share the code publicly but you cannot and do not wanna share the private data. And so you could make it, you could set your project in OSF to private and add collaborators to that project so that they could use the code to access the data but other researchers could only access the code and they would have to either use their own data or request access to you to be able to reproduce the analysis. It uses code and shares it in a way that can efficiently reproduce an analysis. And as Cynthia mentioned, this is often really helpful just for ourselves. I could come back to this in three months or a collaborator could come back to it and make a change to this set of steps and anything that needs updating will update and then other parts will not. And then it provides a platform for sharing interim pages, interim products and final products. I didn't mention that website that I viewed in my browser is also hosted through GitHub Pages. And I found that to be really helpful for internal use when I'm working with a team of working on a project together we'll often kind of update the analysis and look just on the same website. So are you seeing these same results? I am just go to this URL here. We could also view that HTML file. It's easy to view in your browser through our studio by clicking on it and clicking view in browser. And if you're in Finder on Mac or if you're on Windows or a different operating system you can usually just drag that file up into the window of your browser to view it. So some takeaways from me before I step back to takeaways from both of our presentations is that our research practices can and should evolve over time. It's okay to try things out and have things not work the first time. I've done that a plenty. There are many ways this approach can and should be modified and extended. Just a few kind of things thinking about the next way this work might evolve. First of all, in this example I set that repository to public. And so it's plausible that yeah I set the repository to public but I'd like to think about ways to make that the process of accessing private data a little bit more seamless. Could there be a tool built into OSF for instance where researchers could fill out a brief application including questions about how they would protect the privacy of that data or the confidentiality of that data. So something I'm just thinking ahead about is how to make that step a little bit more as fluid as sort of some of these other steps of the process. And then using a tool to access particular versions of packages. As you saw I included information in that report about what version of R I was using and also what versions of packages. But there are tools in R now that make it possible to include packages that you depend on in your analysis these R packages with the analysis itself. And I haven't used those before and I'm interested in if we really want something to be reproducible maybe the packages that we use also have to be a certain version because those change over time. And so I'm curious about how to improve that part of my work. I think thinking about open science and STEM and science ed means carefully thinking about what we care about in STEM and science education. Not everything that open science scholars and other fields do are relevant. And we are likely I think to develop new practices specific to R problems and opportunities that we face. I have a little plug for a pre-print with Rashida like me and Erin Kester where we try to make an argument like that. So here's some parting thoughts that sort of step back away from the last portion of this presentation. Cynthia would you like to summarize these with me? Okay, so open research practices auto-emerge from the challenges we face and opportunities we have. We have some tips to do that and there are some examples that we're trying out but we sort of invite everybody here to develop field and even scholar specific open research practices that are appropriate for your work. This is not a global set of things that we can and should be doing with respect to open science. Starting small is often the best way to begin. We have an inquisitive penguin who invites you to discuss anything that we talked about and more for the remainder of this time. Thank you. All right, I think we have time for some questions. Oh, great. Good question. What is the difference between our markdown and our notebooks? All right, hold on. You may not want to know the answer to this. No, it's okay. All right, hold on. I'm just pulling up a website. Okay, so. Okay, the short answer is that, so I don't know markdown is maybe not something you're familiar with. Markdown is a markup language for HTML. It's been around for a long time. John Gruber and some other folks started it a while back and there's a lot of different flavors of markdown and coding like Markdown XXL and some other things. Our markdown is basically a version of markdown that has additional features that are R-specific. So like if you remember the, let me share my screen for the R studio. So the things like, you know, to write this kind of code with like this R and then this R code in it, like that's a specific R markdown thing. Whereas things like this meaning bold, that's regular markdown. These headers are regular markdown. These are all kind of regular markdown. Links are regular markdown. But then R markdown has some special features and the chunks obviously are part of that. So basically R markdown is just a way of, a version of markdown that you can use in lots of different ways. R notebooks are a specific format. So an R markdown file can have different kinds of things to it. And so what I want to show you also is the website rmarkdown.rstudio has got lots of great things. And it will talk to you a lot about like what are the different options for different kinds of R markdown generated or powered documents. And so an R notebook is just one kind of different thing. A notebook is technically like a interactive R markdown kind of document. But all these different things like the HTML kind of file, PDF version of it, they all use R markdown. So R markdown you can think about as like the language kind of that you're using. It's a markup language to write these things. And then an R notebook is one of the output formats to do that. So that's the short answer. I think like if you want to learn more about that like definitely spend some time if you go to the get started, it really like has a nice overview about like how this works and different parts of it including like what the specific like interactive documents have benefits with the other ones don't. And then if you do an output formats that can tell you a little bit more about like what the different ones will look like. Can you publish a report Dan? Greg, go ahead. Hi, Dan. And a format is directly submitted. So there are ways to do that. I think if you're going to think about I think it depends on like what the template is. There are lots of ways to write like CSS kinds of styles. Again, it's an HTML thing. If you're familiar with Pandoc or any other kinds of like translational tools for doing things like that with like law tech or something else, like if it would be possible to do that using those tools like through that then you can do it through this. And so there are some packages that'll make that easier for certain kinds of output formats and things like that at articles. Okay, cool. Yeah, there are some journal templates that are kind of available this way. I haven't personally tried that because it's, yeah, it's a lot. But it is theoretically possible for sure. And like you can definitely like make any kinds of changes you want to those. David, do you mind if I add onto that real quick? Yeah, please. I just spent like hours updating a manuscript with numbers because we added a few additional months of data from a paper. And if we had written this using our now marked on document we could have those numbers update automatically just like those in-text numbers that Cynthia showed in her report updated. Sort of like a lot of things probably an investment upfront in multiple ways and investment in general just for learning the kind of new kind of language of that particular package or template and then also as an investment for each project. But I think like Cynthia, I'm interested in it. I have a project with Ju Young CEO who's a doc student at Penn State to create a template for International Society of the Learning Science Conference presentations. So we're working on that. And it's just one sort of small part of the possible publications that you might be interested in submitting then. But yeah, I think it's a great question. Love to explore that more ourselves. Also, hi Dan. Josh, do you wanna take this question about GitLab? Do you know anything about that? No, I'm happy. Yes, I'm happy to take it. I'm not familiar with GitLab. Sorry, Eric, when I saw your question over in chat I read it as GitHub and now I see it's GitLab and it's GitLab. Would you be willing to share anything in the chat, Eric? Just for us and others to follow up on. I will say I use a GitHub organization for my research group and it is helpful to have the code for our projects in one place. It takes a lot of discipline on my part and everyone's part to share code there and to update those projects in a way others can use them. Okay, there's a question about Shiny. I don't wanna get into too many things about Shiny. You can create like a Shiny-based kind of flex dashboard within an R notebook. I personally still prefer to do a Shiny app kind of as a standalone app using the regular R scripts but you can do it that way. I do have some examples. I could show you, but I feel like that's gonna go down along radical that is not gonna be generally helpful but if you wanna send me an email I'd be happy to talk more about Shiny. Like in some ways like Shiny is like a completely different topic to talk about but it is possible to do some Shiny related things within regular R notebooks especially like the flex dashboard package can be helpful to make like a simple thing. If you need a dashboard, if you don't want a dashboard, that's a different problem. I would say like I don't do that. I do them separately. But yeah, good question. Again, I will be eventually writing another blog post about Shiny and like that but that's I think a different topic. What's a good way to parse through the many packages available, say good for open science, for beginner, do you have our packages? Oh, okay, yeah. So I think for my suggestion for new people is to not do too many things with too many different packages. One of the nice things about the tidyverse that we talked about it's a set of packages. They're internally consistent with each other. There's a lot of documentation around them. They're really, really great. You can basically do your, well, depending on what your workflow is but generally like most of the typical things you would do with data will be covered in one of those different tidyverse packages. And so starting out like one of the easier things to do is to just like pick one kind of paradigm and work with that. Like if you're trying to do too much with like base R and like weird packages, some random person made and some other things like that can be really overwhelming. And so my suggestion would be like to try not to do that too much when you're starting out. And again, also like try to like start in pieces maybe. So when I switched over to R from like mostly using Excel actually, I kind of did it in pieces. So it was like I feel comfortable doing these things in R. I still want to do these other things in Excel. And then like over time, my workflow kind of switched to like completely R. And so again, like I think one thing to remember is like there is a steep learning curve with R, especially if you're not used to assistive programming languages or different things like that. And so just find the first thing that makes sense to you and start there, start small and just build it over time and have it make sense for you. Like the best way to learn it is gonna be something that is gonna mean something to your workflow and help you. And so try not to overwhelm yourself because it is overwhelming and just start like a small kind of thing to start with. Yeah, great advice. I think one plug for rweekly.org it's a great way to keep up with new packages. But in some ways that's counter to what Cynthia, I think I'd greatly recommended, which is start sort of with things that solve problems that you have, start with tidyverse or at least one sort of set of tools that work well together. Yeah, and there's a lot of people online that are learning R as well. There's a lot of good resources. Hashtag R stats is good if you need help. Again, the other thing too is like if you're not kind of used to programming like Google is your friend. Like don't feel like real programmers don't Google things because that is 100% not true. Like Stack Overflow has lots of good help things. So like if you're getting stuck on something you should 100% Google it and see if someone else has figured out and like just from like real programmers, whatever it doesn't mean anything, but like people are constantly doing that. Like there's absolutely no expectation you should like memorize all of the codes or packages or anything like that. Like you should find the help when you need it frequently as much as possible. And like that is 100% totally okay. Good question about what will a workflow look like when collaborating coding with others? This is where GitHub can be helpful although it's a substantial investment because you can make changes to a file such as an R markdown or an ARM notebook file and you can then update that file on GitHub and then your collaborator can access that most recent updated version and then update it themselves. And GitHub has tools for dealing with those unfortunate situations where you might have edited the same content at the same time and so you have to resolve a conflict there. So that's one way but there's also other ways there or maybe a little bit more kind of dependent on your communication about who's working on the file at what time and you can simply share those files in a cloud storage system such as Google Drive or OneDrive or Dropbox. Yeah, the other nice thing about R markdown files and the HTML files they produce is like they're small files, you can easily email them to people like you don't have to, like again, if it's too overwhelming to do some kind of version control or other thing like you can just like make your document and email the HTML file to someone else and they can just open it in a browser on their computer and like look at it. Like that's one of the really nice things about it I think. So there's a lot of options. Any way you could share an HTML file or any kind of plain text file, that also works. So the sky is the limit. You can use Python in our studio. Yeah, so one of the chunks actually I'll show it really quick. So when you are adding a chunk here, so basically you can add a chunk by going up here and like adding chunks, you can add an R chunk, you can add a bash or you can add Python. And so if you add Python like you can see here the little Python thing is there and you just type your Python code and assuming you have Python installed, this will work. The other thing just to kind of mention since we're here, there's a lot of like settings for the chunks that you can do. You can name your chunks, which could be helpful for some things. You can change the output. So it's showing just the output, the code and the output, nothing either to run or not run. And there's a lot of other options and things like that. So if you have figures you can set like the figure height and you can either get through this thing or you can type your options actually in here as well to like, I want the figure height to always be like four pixels or whatever, things like that. Python, a lot of people know Python and yeah, you can totally use Python, you can do Python and then R and then Python and then R like whatever makes sense for you, that works. All right, I think we have, we're out of time but I'm happy to stay on for a couple extra minutes and people have more questions. We do. All right, thanks guys. So thank you for coming. Yeah, go ahead, Marcy. I was just gonna thank you. Huge thanks to Cynthia and Josh. This was informative and it was detailed. You can tell by all the questions, it was super helpful. We really appreciate it. Everyone feel free to follow up with COS with any questions or with Cynthia and Josh. The discussion doesn't stop here but we thank you for joining us. Have a great day. Thank you, Marcy. Thank you, Claren. Thanks for the opportunity. Thank you so much.