 Thanks. So hi, I'm Haya. My pronouns are he, him. And I am currently the reproducibility librarian at the University of Florida. I'm here today to talk to you about a project that I worked on in my former job as a postdoctoral researcher on improving accessibility and reproducibility in ecological time series analysis. So although this conference is virtual, I still want to acknowledge that the University of Florida occupies land that is the territory of many nations, including the Seminole and Tamekua peoples. And furthermore, the project I'm going to talk to you about involves collections of ecological data. And we haven't gone through and looked at where all this data was sourced from, but it most likely includes at least a few examples of what has now been termed parachute research, which describes the scientists to basically go into a community and collect data about or within the community in a way that benefits the academic in their career, and not always involving the community or trying to understand how their interests might align with the research. So to give you some context about this talk, so like I said, in my previous life, less than two weeks ago, I was a computational ecologist, and my training is in time series analysis, developing packages, software packages, and the programming language R, and working on research and dynamic systems and chaos theory. And so the project I'm going to talk to you about, we are calling mats, which stands for the macroecological analysis of time series structure. It's a software package that currently exists on GitHub and has a nicely rendered website using package down. And mats, the mats team involves several different members of the lab, including both of the PIs, Ethan White and Morgan Ernest, as well as grad students, Renata Diaz and Ellen Bledsoe, our lab manager, Glenda Yenny, and statistician, Juniper Simonis. And so we have a lot of different people in this lab, but we call G-Lab at the University of Florida, with lots of expertise across different domains. So that includes software development, statistics, field ecology, and population and community ecology. And so in general, our lab is interested in lots of big ecological questions. Some of them include things like how accurately can we predict population changes? How far into the future can we make those predictions? What ecosystem properties are associated with those changes in abundance? Or maybe our drivers of those changes in abundance? What kinds of changes occur in the whole ecosystem communities? And are those changes in the community merely the sum of individual population dynamics or are interactions between different populations in an ecosystem important for directing that community level change? And so I feel like I should probably also mention right now at the exact same time, there is a meeting, a virtual meeting of the Ecological Forecasting Initiative. And so we actually have team members that are split. Some of them, I think, might be in the audience for this talk, and some of them might be in that virtual meeting as well. So to kind of summarize our perspective on research and what we're interested in doing, we basically want to do all the analyses on all of the time series. And so in order to really try and approach that kind of work, we really have to focus on data analysis in a systemic way. And so we do this by gathering open ecological datasets, tidying them and ensuring that there's consistent metadata about things like location and the taxonomic ID of species and populations. We use diverse modeling approaches, population dynamics models, as well as machine learning and statistical models. And then we also write software for reproducible analyses. And again, this is something that is achievable because in the Ecology Lab, we have a lot of different folks with diverse backgrounds and a depth of skill and experience. So among the seven members of this project, including myself, five of us have PhDs. And I have to admit that this is not actually the typical situation for an ecologist or an ecological researcher. So for example, suppose you are a necologist and you are interested in testing some of your hypotheses on ecological populations and how they might be changing in time. And you know there's a lot of open data out there that might be used. And certainly this is definitely the experience that many are going through now, given that COVID is restricting field seasons and other kinds of research activities. A lot of folks are more interested in doing computational research. So you have an idea and then you like browse, Twitter, and based on the recommendation of the RStats people, you decide that there's this nice book R for Data Science that is going to get you started. And right in the middle of chapter one, you see this nice workflow diagram that describes how you're going to accomplish your analysis. And so that seems pretty straightforward. Of course, it is a workflow diagram. It is this kind of like flow chart that is intentionally a simplistic model of how to approach data analysis. And so once you start getting into the weeds of doing the research, you realize that there's actually a lot of skills that are involved in trying to do this research, right? You have to figure out how to obtain and import the data. If you're getting data from different sources, you often have to figure out how to get them to be consistently formatted. You might have to change how dates are structured in the data. There might be errors in the identification of species that you need to fix. Then you'll have to come up with your statistical model that addresses your research question. You have to learn programming to code it up. And then after you've gone through all of these steps, you might then learn about this notion called reproducibility and all of this literature about how research is not reproducible and there are these best practices out there that you have to follow. And so this is all kind of like large daunting challenges if you are trained as an ecologist, but not as a statistician or a data scientist or a software developer or a philosopher on how scientific knowledge gets generated. And so as a team, we thought about how we can build a project to reduce some of these barriers for a reproducible data analysis and focusing specifically in the areas that we have expertise, which is working on ecological time series. And so what we came up with with maths is cyber infrastructure that basically has functionality to accomplish all of those steps in the data analysis and make them as easy as possible for researchers to do. And so it has functionality, for example, to allow you to obtain ecological time series data. We have written code that transforms all that data into a common format, template workflows so that you can take an analysis on one data set and basically repeat all of the data sets that we make available to you so you can conduct that large scale comparison. And then finally, we have functionality for generating reproducible reports and sharing your code with other researchers. And so the way that that kind of works in like a software level, we build off of an existing software or existing data kind of collections manager called the data retriever that obtains data sets from different places. We build off of the tidy versus set of our packages for transforming data sets into a common format and deal with irregular or missing samples in the data. We use the Drake workflow package to help organize all of the analyses so that when you are applying your statistical model to a lot of time series, it does it in a way that really organizes all of those different sets of results in a coherent way. And then we build off of templates like those in the use of this package to provide research companion template to enhance reproducibility and make it really easy to share your work. So some of the data sets that we have linked to with our project include the North American breeding bird survey, the global population dynamics database, the biotime database, as well as 10 individually curated data sets, and totaling over 300,000 time series. I don't actually know the number of data points given that each time series has different number of data points, depending on the length of the sampling. And we provide this all in this like standardized data and metadata format so that when you decide on how you're going to do your analysis, you only have to make it work for one data set and then it can be applied to all of the data sets that we give you access to. And furthermore, the most exciting part for me is trying to come up with a way to promote reproducible workflows. And so having had many years of experience teaching in our users groups and workshops for the carpentries, there's a lot of information about being an effective teacher and employing good pedagogy to teach programming and data science skills. But one of the principles that I've kind of synthesized through all that experience is that providing good user defaults is essential for getting people started. And so you can provide a lot of teaching and workshops and readings and resources for learners, but if they don't have a default computational workflow, then when the students try to actually do the research, they end up coming back and needing more help to get started. And so we try to do this by using Drake, which is a workflow package to organize code. And then we provide a research compendium to package all of that up in a research project. So I know a lot of you may not be familiar with what a research compendium is, but it's basically a bundle of code and data together in a way that makes it easy to share and reproduce the results. And the way that we have constructed this for our project is to actually extend the functionality of our packages. So if you've ever worked in the R programming language, you know that there are all of these R packages that are out there that provide additional functionality and you can install them very simply. And so the idea behind the building a research compendium in the same way is that if there is a compendium that defines a particular analysis, a particular research project, you should be able to install that as easily as an R package, get all the data and the code and be able to reproduce the results. So we try to make this as simple as possible. So I'm going to kind of guide you through how we have thought about this. So this is the only graph I have in the talk. And so I have on the x-axis the ease of use and the y-axis capabilities of different kinds of tools. And so on one extreme, you will have something like the curing one push button coffee machine or you push the button and it makes coffee. It's really easy to use. It does one simple task really easily. At the other extreme, you have these programming languages, which can do a lot of different things, but there's a steep learning curve to learning how to use those tools and get started. I put Microsoft Excel over here. It has a little bit more functionality than the coffee machine. But in many cases, if you're doing a statistical analysis, you're still pushing like a few keystrokes or a few buttons to run a very specific statistical test. And then, of course, I have to put the conference travel reimbursement as very difficult to use and only performs one very simple function. And so we have designed mats with the intention that trying to produce these research compendium that do these large scale data analyses as following somewhere in this area of the space. So that it has as much capability as those programming languages, but also as easy to use as possible so that basically with one push of a button, you have accomplished as much as the analysis as we can reliably automate for you. And so we do that by providing you with the functionality where with one line of code, you can create a research compendium. And in fact, it's so simple that we have actually built in scripts to automate the example of it. So I'm going to go through and click on this to show you now. So we structured the math project as the software package on GitHub and we have automated the testing of it so that every time we make changes to the software package, we automatically generate a new example of the research compendium that by default gets created when you run that line of code. And so what you're seeing here is that automated example that exists on GitHub of what that sample research companion looks like. So it in fact is an R package that you can download from GitHub directly and contained within it, you'll see an analysis folder that has a templated report that shows you an example analysis applied to some of the sample data sets that we have included. And so this is done in our markdown, which will be the topic of a later talk I think in this conference, but you can see it goes through the instructions of exactly how to read in the results and process the results and then generate plots of some of the outputs for this particular example analysis. And then of course, because we care a lot about assigning credit appropriately, we also make sure that all the data sets that get used in this analysis report automatically get added in as citations to this report, as well as references to this the math software package that generated this example analysis. So let me go back to my talk. Okay, so we have that automated example. And you can go and click on click through and see what that looks like. And so we have we are actually using maths currently in several different kinds of projects. So both internal and external collaborations in the lab. One example is to explicitly study forecasting and what influences the forecast skill of populations on testing different hypotheses about whether life history characteristics or time series properties or environmental covariates influence the forecast skill with the idea that we want to produce general guidance on how to choose a good forecasting method. A second project is what we call maths LDACs, which applies this particular statistical model, the latent Dirichlet allocation and Bayesian time series analysis to identify and quantify different patterns of community change to test whether across all the different ecological data sets that we have, what kind of patterns of change we see in those whole communities. And so to sum up, I've talked a lot about how there are these barriers for large scale computational research. They aren't all going to go away immediately. But we think that, you know, we our project here is a prototype example of how they can be addressed, at least partially through the use of tools and technology. Doing this work, of course, does have upfront costs because while you're creating these tools, you are taking time away from doing the research yourself. And so it's important that we get funding and support for this from funding agencies. So we're thankful that like the NSF and the Moore Foundation has helped to fund this work. But we hope that the results of this project is going to end up being a force multiplier for research and help lots of other researchers out there do really cool things with these data sets. Thank you. Great. Thank you. Yeah, we got several questions. And I guess I would just jump in and say thank you to the Moore Foundation for supporting the CSV Comp as well. It's always nice to thank our funders. The question, I think we have time for one is can, how can this type of work be better supported, incentivized, and scaled up? Do you have any kind of thoughts about that that you would? That is a great question. I would say that is something that I am thinking about a lot more in my new position as the re-predecibility librarian. I'm really interested in how we can make structural systemic change in academia to support these kinds of efforts. I think it might have been Vicki, Steve's actually who mentioned earlier that a lot of these kinds of work, this like support work is all, you know, gendered and valued less than for, for example, like the, you know, lone genius or the innovator. And so I think it's, you know, we really have to be talking with, you know, like funding agencies and publishers and making sure that there are incentives and acknowledgments and recognition of these efforts in order to really promote them. Yeah. Okay, well, thank you very much.