 I'm very pleased to welcome Professor Richard McElrath. I'm very much looking forward to his presentation as are many colleagues, so no pressure, Richard. Richard once told me that he's in this unique position to irritate scientists from all sections of Max Planck, which is an excellent basis for any talk. So without further ado, Richard, please. Okay, thank you for that introduction, Stepan. I thank you all for the invitation and the chance to say something hopefully entertaining about the role of software in science. I am an evolutionary anthropologist and let me use that to situate my interest in software for you. So the problem that I study is where people came from. That's it, right? It's the modest scientific problem. It has no real important economic consequences. I don't think it's the most basic of basic research. It's essentially an existential question. And it's a tough question because we're extremely unusual animals, people. We're a very new species. Our species is only about 300,000 years old in a recognizable form. We rapidly spread out across the globe as this slide is illustrating, starting around 200,000 years ago in Africa and spreading out rapidly to colonize other continents, only getting to the Americas about 15,000 years ago. And all of technological evolution and social dynamics and change and the rise of agriculture and the space age and Sputnik and all that stuff has happened in a blink of an eye in geological time. And those processes are not the sorts of processes that transform other animals. And so at my Institute, we try to understand these things. And one of the things about this that makes the problem so acute is that our species is difficult to study within the context of the way we study the rest of organic life because we're extremely successful compared to other primates. So what I'm showing you here is a funny cartoon way of illustrating the impact that we have had as a species. Most, so this is biomass distribution. That is, if you weigh living things on the planet Earth, how much of the relative mass of all the living junk on the planet's surface is different kinds of life. Now, most of the Earth is minerals, right? Lava and such. And there's this tiny surface of living things and most of that is plants. And that's what you see on the left of this slide. The vast majority of all the biomass of living stuff on the Earth is plants because all life gets its energy from the sun and everything that's not a plant is basically mopping up what the plants produce. So the plants, it's a reverse approximation, everything on Earth that is alive as a plant. Then animals are this tiny sliver of things. The rest is bacteria and viruses and such. And then there's this tiny little sliver of animals in the gray area. I think you can see my pointer, yeah? This gray area here, which we zoom out. And among the animals, animals like us are also a tiny amount. That's this section in the lower right. All mammals are down here. To a first approximation, all living mammals on the planet are humans and they're livestock. And if there was gonna be a species that dominated and transformed the planet and created the Moxplonk Society, it would very unlikely to be a mammal because mammals are so rare and inconsequential. And so I tell you all this to set up how difficult the problem is. We don't even know where to start because there's no other species that has undergone these sorts of transformations. And the last place you'd look for a species like this to arise would be in the mammals because mammals basically are terrible. So setting up that level of the problem, the way we approach it in evolutionary anthropology and at my institute and other institutes in the world is to use all kinds of evidence together. So we do paleoanthropology, we do genomics, we do archeology, we do ethnography mainly in my department. I'm a social scientist actually. I do STEM like things, but actually I'm from the humanities and I come not to praise the sciences but to vary them as I often say, but I participate as an ethnographer in this mission to understand where our species has come from. And this is incredibly difficult because it creates a bunch of secondary problems. And those secondary problems are getting to disciplines to talk to one another, how we handle the different kinds of evidence we use. It's the sociology of my field that is the day-to-day struggle of it. A big part of what we do in my field is entirely dependent upon software. And I think this is a standard feature of all the sciences now, probably all the sciences except maybe for some parts of social psychology are wholly dependent upon algorithms to process data and produce inferences. And this is true obviously in genomics in very detailed ways, but it's also true even in my corner of the social sciences because the inferential algorithms we use, I mean, model fitting takes days. And those algorithms are these vast stacks of auto differential solvers and such. And without that, we couldn't do our work. So we depend wholly upon software to extension of our brains. I've developed software myself and have been involved in a number of open source initiatives and still involved in some of them. And the contrast between these two problems is what I wanna bring up. There's the general problem of what I do scientifically, the problem of understanding where humans come from, how we came to dominate the planet and the complexity of managing the evidence and continuously integrating insights to create better explanations. And then the process of developing scientific software and integrating it into that process. I think in both of these things, there's this continuous integration model to come about where we're trying to take our work and integrate it into a common thread, call it a master branch, if you will, with other people's work and do this in a responsible way. And I've been often struck, impressed would be the wrong word because it sounds positive. I've been struck by how professional this is on the software side. When I work with software engineers and the open source development community in particular, people know how to do this. We can work together in groups, even if we have never met one another. We have software platforms to help us do those things. And then my scientific colleagues, when I tried to do this, it's a mess. It's incredibly difficult. We can't open one another's files. Nobody knows how to do version control, et cetera, et cetera. And so this is what I wanna talk about today is essentially what all of this, this common distributed professional infrastructure that is not universal, but extremely common and normative in the open source scientific software development community, what all of this can do for science as a whole, which mainly lacks these norms of interoperability and testing and version control. Now you can't directly transfer these things. So I don't wanna say that we can just take it and apply it to scientific dataset management in a direct way. But in an analogous way, it has a lot to do. And in my department, we are managing our scientific databases in an analogous fashion to a way we manage a code database. So let me try to expand on this and hopefully I can make it interesting and not too boring. So examples, throughout the talk, I'm gonna try to make a broad point and then give you one example of each thing. So here's an open source project that my department has invested a lot in and we continue to invest in and it's become absolutely essential to the inferential projects in my department. And it is the Stan Math Library. Stan is not an acronym, even though this is a software project, it's not an acronym. Stan is Stanislav Ulam, who was one of the, was a Hungarian mathematician and one of the inventors of the Markov chain Monte Carlo method. And the Stan Math Library does automatic differentiation. So you program generative models inferential models into it and it does all of the gradient calculations analytically for you from the program code. And this is used, this is how deep learning works. In a sense, it uses these auto-diff algorithms to do gradient descent and so on. So we use this in all of our inferential projects and it's a big distributed project. Typical, there's lots of programmers, some of them are mathematicians, some of them are serious software engineers, somebody's doing the templating and there's lots of specialized expertise that goes into this. And I'm on the end as a end user and I write probabilistic programs and try to break it. And then I put in issue reports and so on. And this is an extremely active project and well, not compared to something like a Linux kernel but extremely active project in the sense that every week there are important things getting reported and fixed and new features being added and so on. Many, many contributors, many of them who have never met one another. And but it all works. We can continuously integrate the patches and so on. And there's nothing remarkable about this project in that regard. Many of you do this sort of stuff yourselves and you know. This is standard professional behavior for this kind of project. And it's why it works and it's trustworthy and I trust it to produce inferences. One thing I wanna highlight and I pulled out here is I checked the latest repository. There are 3.6 megabytes of library code that gets compiled down and used that we can use then to build other things on top of. And inside the project there's more than double that just test code, just unit tests that exist to know we're not breaking stuff when we make edits. And those of you who do software development you know what this is and it happens to you too. You end up with more testing code than you do with actual deployed code. And this is part of the norms that the responsible professional conduct of a project of this kind. And you can guess what I'm gonna say is that in the sciences there's really nothing analogous. There are labs that behave like this but the difference in normative inferences is almost the opposite in the sciences. So let me make a movie reference here. So in the movies this is what a programmer is, right? In the movies programmers are hackers. There are people that wear dark hoodies and they don't have a lot of lights in their house and they're doing some secret sort of thing on the computer which is incredibly powerful. So they work alone, they use secret methods, things aren't documented and they produce amazing results as a consequence. Real programming of course is nothing like this, right? It's teams of people struggling to understand when there's code and using Git and other things to make things work and make projects come off. There is programming that's unprofessional though of course I don't wanna say that's not true. There are hackers and all programmers are hackers sometimes. But I wanna emphasize is that most scientists are like hackers. It's not that they necessarily work alone but they work in small groups, they don't document their methods. They don't have a clear way of doing version control other than copying files and renaming them. Even in databases it's like that. There's almost no testing to show that the pipeline works before it's deployed and so on. So scientists are also represented in movies and some of you will know the screenshot here at the bottom this is an old movie about time travel where there's a mad scientist sort of archetypical mad scientist and scientists don't usually look like this but they do often work like this with crazy laboratories. And if you ask them for the details of their procedures it's quite embarrassing to pull things up or they just can't remember how they produced it. And let me try to take that beyond just like random insults to science for a second. I am a scientist and I love science that's why I still do this. And but there's a serious issue here and it's I'm not the only one obviously who's noticing this. Here's a quote that I quite like from 2015 from the editor at the time of the Lancet the most prestigious medical scientific medical journal in the world is Richard Horton. And this is the context of this quote is that the UK government had a sort of emergency closed or council meeting among the big science leaders of scientific organizations in the UK because they felt that there is a reliability crisis in the sciences, especially in the medical sciences and they needed to do something about it. And so Horton couldn't report exactly what was said in that meeting but afterwards he said this in an editorial he wrote in the Lancet. And if you'll indulge me I'd just like to read it to you with a tiny bit of theatrical drama. The case against science is straightforward much of the scientific literature perhaps half may simply be untrue afflicted by studies with small sample sizes tiny effects invalid exploratory analysis and flagrant conflicts of interest together with an obsession for pursuing fashionable trends of dubious importance science has taken a turn towards darkness. That's the editor in chief of the world's most prestigious medical journal and he's not the only one, right? This was the consensus of that closed door meeting among leaders of scientific granting organizations and so on. This is a problem. And there's a lot of evidence for it and we know some of the causes. Here's a cause that's sort of fun to play with. I made this joke about hackers. Well, in the sciences now there's this term called P hacking and I don't know how many of you have heard of it. It's a nice term because it gives a name to something that has been going on for decades of course and desperately needed a name. So what I'm showing you on this slide is a screenshot from a website which lets you actually do this P hacking thing which is something that scientists are not supposed to do but this web app lets you play around and figure out how statistical inference can be corrupted by multiple testing essentially. If all we publish, if we try a bunch of different things with the data and only publish the hits then that produces false positives. It produces false results. And this is the P hacking crisis. So some of the problems with the medical literature and the other sciences as well. Psychology has been criticized quite strongly in this is due to these sorts of practices. The ways that scientists conduct themselves they have inferential methods which promote their careers but don't necessarily create reliable knowledge. And this has been quite, this has been talked about a lot. This is not what I would really wanna talk about today to be honest. But to give you an idea how important this is in the year 2000 in the United States the medical community started requiring clinical studies of new medicines that were going to be brought out on the market to pre-register their outcomes. And that means what is this medicine supposed to help a person with? And prior to 2000, they didn't have to say so they could just take a chemical compound and test it on a bunch of people and measure anything they wanted about those people and see how it helped them. And so prior to 2000, there were lots of drugs being discovered just for random things. And these pluses on this graph are improvements in some condition that had not been pre-registered. They didn't intend the drug to be used that way. The problem is that it turned out lots of drugs were going to market through this procedure that actually didn't work because these are flukes. There are things that were found in the sample if you measure a hundred things about people and that's what they were doing. You will find one thing that improves just by chance and then it goes to market and it turns out to be very expensive for pharmaceutical companies to do this because they can't actually sell drugs that don't work and believe it or not. So in 2000, they got together and they lobbied for this change and you won't be shocked to learn that drugs stopped working miraculously after the year 2000 in clinical trials. And now it's extremely rare just these two hits for example, that clinical trials produce drugs that are an improvement over the best alternative treatment. So this has been a major scientific reform and this is the p-hacking problem, which is again, not mainly what I wanna talk to you with the remainder of my time, but it's important and it's important to put it up in the front. So things like p-hacking are in this category of issues with scientific conduct. The causes of irreproducible, that's quite a word, vortungohoja, research. And so what you're seeing here is a graphic from a 2016 survey that Nature, the journal Nature did with a broad range of scientists asking them subjectively what they think the causes are irreproducible research. And the top ones here are these things that I'm going to label playfully greed. They're driven by professional incentives and they're things like p-hacking. You have to publish or you go out of business, right? So selective reporting of results, that's p-hacking. The pressure to publish has an incentive for that and low statistical power as a way to farm lots of selective results. There's this other category, which I don't think receives as much attention as it should and that's the topic of the rest of the talk for the most part. And these are the things that I would again, playfully label slots, if we're going with a biblical set of sense. And these are things like the lab not working enough originally to make sure it was reliable, insufficient oversight of the results or monitoring so it's like sloppy work, lack of documentation. No one knows how the result was calculated. The code was not saved. It was some postdoc did it and they've gone off to another job. Poor design and you can't even get the original data. So you can't reconstruct how they got their result. So this is the sort of thing I say, which is embarrassing. If you ask many scientists to demonstrate the evidence to their papers, they can't because they no longer have the data and they do not have the code. And this is a major crisis I think. It's an empirical crisis because it produces unreliable results and it's a moral crisis because it's deeply unprofessional and it's a violation of the public's trust in us as scientists. And this is before I should know we get to things like fraud. Everything here is just slapping us, right? We haven't even gotten to the lower circles of Dante's Inferno with fraud and treason. Okay, examples I said, I just made a very broad claim. I wanna give you some examples that are lighter and hopefully some entertaining as well. So here's a result, which a drama that starts in the year 2010. So think back to 2010. In 2008, there was this a small economic problem you may recall started and some things happened with the markets. And 2010, this is all economists were still talking about is how to recover the world economy. And so there was this very important paper in that conversation from two Harvard economists, Reinhart and Rogoff in 2010, it's called growth in a time of debt. And they had an empirical analysis of the relationship between public debt and the growth of an economy. And they made this argument for what is called austerity, which means governments not spending money on recovering the economy, which a lot of European economies followed the advice of this paper. This paper was influential both in Europe and the United States. In the United States it was waived literally physically on the floor of Congress as an argument for not spending taxpayer money on a stimulus package. And as you know, as Europeans here, it affected European monetary policy as well. The paper is, well, whether this is the right argument or wrong argument I can't take a stand on, but the result in the paper showing a negative relationship between economic growth and the amount of public debt is a numerical error. And it is a numerical error that was discovered by a graduate student pictured here. This is Thomas Harrington, who was a graduate student at the time. Now he's a professor of economics who tried to reconstruct the numerical results in Reinhart and Rogoff's original paper and was unable to. The data weren't available. The code wasn't available. He just had the result in the paper, but he could, these are public data, presumably. You can download the GDP data, historical GDP data for all the countries. You can get public debt statistics and you can run the model yourself. And he kept doing that and couldn't get their result. So he eventually got the authors to respond to him and they sent him their spreadsheet, which is very good actually. It's great behavior on their part. They deserve a lot of credit for that transparency. And it turns out the result is entirely an error inside Microsoft Excel. Now, if you're like me, I'd say I have very strong opinions about Excel. I think it should never be used in science. Basically, I'm freely hardcore anti-excel person. And the reason is because it tolerates all kinds of error. In fact, it actively generates error. In this case, the error is one where the cell formula did not include all the necessary cells and they just omitted data from some countries and never noticed. And Herndon noticed this because he did the forensics on their spreadsheet. And this is actually their spreadsheet. And this is the cell formula that should extend all the way down and include other countries and doesn't. And their result evaporates when that's gone. So there you go. The lack of a stimulus in some parts of the Eurozone is due to the omission of five cells in a Microsoft Excel spreadsheet. Okay, sad. What's my point here? The point is that, no, I shouldn't show you that, it's too interesting. Okay, so the point is this is the kind of thing where basic testing of the inferential pipeline finds these problems before the paper arrives. Or the basic professional conduct of the sharing of the inference path, the kinds of things that in, again, open-source software development, we have ways of testing these things and making sure the inferential pipeline works before we get there. But in the sciences, you can have an influential paper, but none of the evidence of how the inferences produced is shared at all. And sometimes when you ask for it, people are insulted to be asked for it. So that's a difference in norms that I think has to be addressed. Here's another fun example with Excel. Again, I said, I hate Excel. This is just me exercising my grievances, but this is a fairly recent. This is just from August this year. The human gene nomenclature committee, there is an international committee which names genes in the human genome, which is good because we like to standardize on names. And they have renamed a bunch of genes, however, because these are names that Microsoft Excel will automatically convert to a date. If you've used Excel much, you know that Excel thinks everything is a date and it will just pepper your data with dates. And this is one of the reasons that it is forbidden in my department to store and manage data in Microsoft Excel. Something like about a fifth of all published human genome papers have errors of them that result from Microsoft Excel automatically converting gene names to dates. You can imagine the chaos that that creates in a published literature. And so it turned out that Microsoft was totally unresponsive to this problem. That won't surprise you. And so the human gene nomenclature committee decided to rename all the genes. This is a lot of work because a lot of things have to be changed, but this is what happens, right? From the lack of a professional set of norms about how to use scientific software, how to test that it's doing things correctly, right? This is a basic unit testing sort of problem. But it's extremely common in many fields. Another example, just very quickly, I'm conscious of the time. There's a lot of really consequential medical research, for example, on cancer, where the industry itself is upset with the scientific community, the pharmaceutical companies, because the scientific community is producing results which cannot be repeated even once in somebody else's lab. And so this is from 2012, this is an old problem in a sense, right? Where buyer healthcare here in Germany tried to repeat a bunch of very highly cited preclinical cancer results and found that they could repeat only about a quarter of them. And this is extremely expensive, as you can imagine, just to try a study like this, but it's even more expensive to try to go to market with something that doesn't work. And the cost in human life is, of course, what we should be talking about. And that's the big cost. It's when people die because you've tried to treat them with things that never worked in the first place. So this is a big problem and buyer's diagnosis was that this was sloppiness. There was tremendous sloppiness in the original labs. The original labs don't even even kept the data. They didn't know what happened because the procedures for getting the result only existed in the heads of a postdoc. We could do better than this. Okay, good news. This is not a new problem. This is not some symptom of modernity or postmodernity or the industrialization of science. This is a problem that's always been around. It's amazing that we've produced, science is a tremendous thing. This is the line that I use in the abstract. I believe that science is an amazing accomplishment and the cumulative evolution of knowledge about how nature works is a tremendous achievement. But academia is a mess. And it's amazing that the testimony to the power of science is that it can grow despite academia, not because of it. And so here's a book that I quite like. It's a history of science book. It's a history of the periodic table of the elements, which as you know, goes back quite deep. And we look at the periodic table now, we think, oh, that's science. That's eternal knowledge that would be true anywhere in the universe. But assembling it was a mess. It was all the same issues and sloppy results and irreproducible results existed in the hundreds of years of building the periodic table of elements. So the authors here, which are chemists, by the way, two chemists and a historian of science summarize the book this way. There are many more elemental discoveries in quotations later shown to be false than there are entries in the present table. So academia has always been a bit of a sloppy mess and this is a chronic problem in lack of professional conduct and the hobbyist, mad scientist nature by which things go on. And this is what I'm arguing we should do something about. Okay, I do publish occasionally papers on this topic just to say that I'm not purely pontificating to my colleagues here. I have stood up and stick some of my reputation on these critiques and I'm happy to do that. I do not believe I'm having much of an impact though, just to undercut myself a bit. This is one of my more cited papers in recent years is paper with Paul Spaldino who's at UC Merced on the natural selection of bad science. The paper is mainly about how incentives distort scientific conduct and cause irreproducible results because I've come to think these basic responsible conduct of analysis problems are much bigger than this. Or one way to put this is here's in meme format everything works better on the internet in a meme. This is me discussing science reform on the left and that's science on the right having a really good time. So what's the basic problem? I think the basic problem, how we got to this point is that if you wanted to learn to be a carpenter you'd study with a carpenter. If you wanna learn to be a professor you study with a professor but what you learn is how to be a professor, right? And how to be a professor is how to get funding, how to get published, how to get cited. It's not how to produce eternally reliable knowledge, right? That's hopefully part of the game but the essential skills to being a successful professor are the things I've listed on this slide. And that's a basic material problem with academia. And as a result the actual research skills and the responsible normative conduct of research and documentation and all those things that in open source software development are part of the professional norms of how we do things. Those things if they exist they're informally transmitted. They're not the essential things to get you a job. Hiring committees don't ask you how you manage your database. Okay, so the things that aren't taught we get back to talking about software norms now. Scientists usually are not taught as part of a PhD curriculum how to organize their data. They're taught almost nothing about that. Graduate students open up a spreadsheet without any advice from somebody else. Maybe there's a senior postdoc or a junior scientist who helps them but there's no curriculum. And then they start keying in data. And chaos happens because we're all humans and these are difficult things and we need professional advice on how to do this. There's the organization of data and then there's its curation. That is how do you continuously integrate new data and maintain its integrity and intelligibility in the long run so that science can be cumulative. And that's a separate problem. An analogous to problems in software development which have been addressed much more vigorously in software. Test and data procedures, scientific analyses and pipelines can be tested and some labs do. I know a lot of labs that do do this but a typical scientist does not. They have an analysis, they click some keys, it's not documented, it's reported inaccurately in the papers and so on. Biggest top level problem is how to manage distributed contributions. The continuous integration of different results sourced from different data sets to stamped at different times into a common body of knowledge is a huge problem that I will have nothing clever to say about today but I think this is our top line problem in reforming the sociology of science. Okay, so I wanna give you an example just to say that I'm not just making up gossip about the sciences. Here's a project that a PhD student in my department has undertaken. Rihanna Minnaker has run this audit study basically of the quantitative evolutionary anthropology literature to see how many published results she could reproduce. And Rihanna has not done this alone. Of course, she has a vast team which I represent here that is helping her with various parts of this but the intellectual design of it is hers and she deserves most of the credit. So what this is is Rihanna wanted to go into the sausage factory pictured here of quantitative evolutionary anthropologies. She randomly sampled 560 empirical quantitative studies and attempts to reproduce them. That is, can she get the data? Can she get the other materials including any code? And then can she take those materials when they're available and reproduce the published numbers? And all of this is up on GitHub and you can look at all the details of it there. So here's a quick version in one slide of something that consumed two years of Rihanna's life. So there's acquiring materials or 560 studies in 11% of those cases you can just get the data code online. It was just a click and that was very satisfying. In the rest of the cases, it wasn't available. There was no data and code. And so Rihanna and her team identified a contact person and sent a request and that's for the remaining 84% of the papers. In 56% of those cases she received a reply. I think that's actually pretty good. I had a bet going for a very much lower number on maybe two cynical. And then about 20% of the cases this eventually correspondence with those people eventually resulted in data and code being received. So overall in 30% of the original sample could we get data and materials? I don't think that's great, right? The other side of this is given recovering data materials can you then understand the analysis pipeline well enough to reproduce the results? This is better actually. We found that conditional on getting materials often the documentation is quite good. And in about 80% of those cases we could understand the analysis, repeat the analysis and get the same numbers as in the original study. That could be better. 80% is there's some embarrassment there and regret in reporting that number but it's way better than 30%. This is the problem is that people aren't even documenting and saving the procedure by which the result is produced. They're only saving the result. Okay, so just to try and make this a less of a dry presentation for you here's some quotes during anonymized quotes that came from correspondence with authors when Rihanna and her team contacted them. And I should say there's a very careful protocol for these conversations that was all preregistered and set up. They had scripts basically Rihanna and her team so that it was all very collegial and giving the same responses to different kinds of responses. This was all very, we're social scientists in my department so we do this as professionally as possible. But we wanted to do a qualitative analysis of the responses and that's why we have these things but they're anonymized. We won't reveal anyone's identity here. So here's some fun ones. There is no code scripting or paper data still in existence. This was for a study that had been cited hundreds of times by the way. This does not mean that the results are in any way invalid. I do not agree with the focus of your project. Well, thank you very much. Looking at studies done even just a few years ago, for example, published in 2015 or earlier is completely counterproductive. That's done, right? I mean, we're moving on. We got new papers to publish now, right? And my favorite one, I don't really see the point here, right? We didn't get any materials from this person. There were some good responses though but by and large authors when they didn't have materials to give us, they apologized. They acknowledged that they were violating norms and they felt bad about it. So here's some good ones where people were basically coming clean to us and confessing, my God, that was 17 years ago. I have no idea where that data is, right? So again, this is a study in this particular case I know that have been cited hundreds of times and spawned many other studies to follow up on it. And the evidence that it makes any sense at all is gone. Should we build things like that? I don't know. Here's my favorite one because it's about the procedures and the sloppiness point that I keep making. Such studies often use expertise from several people and make multiple intermediate versions of the data sets. Let me pause and read this quote right now to say think about software development. That is what software development is like. You get expertise from several people. We make multiple intermediate versions of the code base but we have tools for integrating and managing that process because we know it happens every time. In the sciences, there is no middle sociological structure that binds us and lets us bring our contributions back together in a responsible way. So what you get instead is, resuming the quote, this is often done without really knowing what will end up in a paper and what will not or even what the paper will be about. Okay, well some software projects are like that to be perfectly honest but lots of scientific publications, even high profile ones work this way because they're peahacking, right? And then none of that procedure is safe. Our study is not the only study of this kind. We looked at evolutionary anthropology. Here's a nice paper that just came out recently from Kalina et al on the literature in ecology and the difficulty of recovering analysis code in ecology. Ecology does a lot of big data analysis these days. It's becoming a big data field. However, in their random sample of 346 articles from the top journals, they were only able to potentially reproduce results in 20% of the cases and this is just not acceptable for what is mostly publicly funded research in my opinion. So let me get to the punchline. I've been setting it up this long. Software engineering, especially open source software engineering and the CIS admin culture that goes with being responsible for infrastructure has a lot to contribute to the science in the present day because it's become essential to the way the sciences work. Most sciences cannot work without IT, crudely speaking but it has sociological things to contribute as well because it has to set a professional standards that allow people to work in distributed teams, often internationally and continuously integrate their contributions, test for code before it's deployed, things like that. These professional standards can be translated. They can't be moved directly but they can be translated to conduct within the sciences and the sciences desperately need this. Things like the translation of version control procedures instead of copying data sets and modifying them. I've seen it with many of my colleagues. Say I first got into being concerned about this because I started my career by doing meta analyses, basically being on teams where we did common protocols in different places and then tried to pull the data together and getting data from other people. I learned very quickly that people couldn't even figure out which version of the data set they used in the published paper because they had a folder that had multiple Excel spreadsheets in it and they were named like data set one, data set two and those sorts of things. Imagine if code projects were managed that way. I know some are, I've seen them, but by and large, my experience with the open source community, it's much better. Testing, scientific analyses can be tested and I'll have some examples of this when I get to it. So I just beg your indulgence to wait for a second. Documentation obviously is saying how you got the results and having it in place that's like code. Having code is a form of documentation and then the continuous integration problem, which I don't know how to solve this or translate it to the sciences in general, but we could do a lot better with this. So this is what I wanna say. Software engineers have a lot to contribute to science in a basic normative sense. Some quick examples to give you an idea. So this is what I said when we talk about version control of the sciences, happened to me, my whole career, I asked people how they performed an analysis and they say, let me see if I can find it. They look in their folder and they find a bunch of stuff as pictured on the right. A bunch of scripts, if they have scripts at all, they might just be clicking in a stats program to produce the analysis. If they have scripts, they have multiple duplicated copies and they used one version produced the analysis, but they don't know. So they'll just send me the whole folder sometimes and then it's impossible for me to figure out what happens. Version control obviously is a topic that scientists have started to talk about in these cases. So there's a whole nonprofit that's focused on this actually, the software carpentry and data carpentry workshops, which are aimed at teaching scientists basic responsible conduct with data. And we run every year one of these data carpentry workshops at my institute now to teach researchers at all levels, just basic responsible version control and planning and error testing for the curation of their data sets. Things that they are not taught anywhere in their scientific career up to this point. So you get, again, the first workshops are teaching you basic stuff like how to organize your data in a spreadsheet. And you might say like, well, what could be taught there? There are lots of crazy things people do if they've had no instruction. For example, they'll make spreadsheets with 2000 columns and then the analysis requires a bunch of reformatting of the data. But then there's cleaning and there's error control and there's a whole curriculum to move people away from spreadsheets actually, but you start them with spreadsheets and you meet them where they are and then you educate them into responsible conduct. And all of this is patterned after how the open source software engineers are taught how to use Git and so on. Okay, also we have a whole group in my department. There's nothing special about my department. There are lots of other research units in the world who also have groups like this. I've just copied someone else's. We call it the data providence research group which is focused on the continuous integration of all the data sources that the research projects in my department use. And we have this distinction between the primary sources and the testing and the auditing and the coding processes that go into producing these green nodes, the primary sources, they're integrated into a private database. We department social scientists and I study people and so the master database has to remain private to protect identities. And we continuously integrate these things using a version control system. And in the last three years, this group has processed about 40,000 interviews in this procedure using continuous integration. And there's also tests that are run on this. I'll get the tests in a second. The tests are to test for things like impossible family structures. Again, I'm a social scientist. Basic human biology means that children have to be younger than their parents. So if in your database you have a child who's older than a parent, you know that's an error. So we scan, we run test units on the database of this sort and we find lots of mistakes from the original data collection and we fix them. The whole system is also set up that we can publish public versions when we publish papers. The data used in the paper is an extraction from the master database, but it's timestamps. And so when the master thread gets updated later, we can re-export the data relevant to that so that if errors have been corrected in the meantime, people returning to that paper we published before get a fixed version of the dataset. We also save the queries. And so if people ask us how we did stuff, we can tell them exactly because we know the query but we know when it was run and so on. And again, where do we get this? This all comes from what I learned doing open-source software development. This is where we get the idea. It's nothing remarkable at all. It's just most, well, at least in anthropology, this is hardly ever done. Okay, I'm gonna wrap up here. I'm conscious of the time. And so just a couple more points to make. Testing, I talk about unit testing a lot. Those of you who do software development, you know the importance of tests. If you're talking to a colleague and they're developing a project and you ask them about their unit tests and they're like, what are those? You're gonna get worried about their code, right? And so this is an issue with scientific analyses. I'm very concerned with how analytical pipelines work in the sciences and the lack of planning and testing that goes into them. The good news here, and there is really good news is that there's been a lot of work in the last, especially 20 years on the ability to prove that a statistical analysis can, in principle, reveal a causal effect of interest. And Uda Pearl, his book pictured on this slide, is a key person who's contributed the most in this. But there's a whole community that works on this now, which is basically the rigorous logical proofing of an analysis. Given a set of assumptions, could you even, in principle, learn what you think you're gonna learn from this analysis? And it turns out there are lots of published analyses, which cannot, even in principle, do what they claim. It's not an empirical argument at that point. It's just a basic issue. So this is what I call testing and analysis, having the unit tests. And this can, this is an algorithmic thing, in some sense. This can be made into an algorithm. Let me give you an example without going into any algorithmic details. This is something, again, I'm using examples from my department, but there are lots of units doing these sorts of things. So here's a paper which was published in 2019 on a very hot political topic in the United States. Some of you may have heard that there's a problem with policing in the United States. The American police shoot a lot of people. And that's, there's no argument about that. The data are clear. But there may be racial bias in that use of police deadly force. And there's research literature on this in criminology. This is a paper from 2019, which rightly pointed out that a lot of analyses don't control four rates of crime in different demographic groups in the United States. And so then they applied a correction for rates of crime and found that there's no racial bias in their analysis. The problem with this paper, its intent is great. So I don't wanna censor the authors in any sense. The authors aren't up to anything here. They made a valid point, but they chose a method of statistical correction, which can't even work in principle. And so one of the scientists in my department, Dr. Cody Ross is the first author on the paper who eventually wrote, notice this. And basically in this much algebra, you can prove that the statistical correction in that previous paper doesn't do. It actually makes things worse. It actually magnifies the bias. This is what I mean by unit testing and analysis. We don't even need data to know whether an analysis can in principle work, but there's no set of norms or procedures in most of the sciences for asking that question, can the statistical analysis be proven in the cold hard light of mathematics to function? We don't need to invent any new math to do this. The math exists. It's just that there's a normative expectation that people use it. Okay, thank you for that. That's my rant. I'm very animated about these things. And issues like that, of course, there are health implications, right? Okay, so what am I saying? I'm saying translating unit testing to scientific studies goes something like this. First, it's possible to express a theory as a probabilistic program. Any generative causal model can be written as a probabilistic program. Then you can use an algorithm to prove whether the analysis can work or not to identify some causal effect of interest. That's things like utipural has derived a bunch of theorems which make this possible. And his students and collaborators have written computer software, which uses those algorithms. You can then create synthetic data sets and test the pipeline in classic open source software. Since you can test whether the pipeline works. And then finally, after all that, now you're ready and we trust your pipeline, it's time to put real data in it. And of course, it's important that all of this history be open and available in a public repository so that people trust the analysis. Okay. So what am I saying here, I'm gonna come to the end. The big problem and that shared in common between the endeavor of science and the endeavor of developing open source software to support science is an integrating work from different experts and doing it in a responsible way and doing it transparently in public so that people who come after us can have some trust in what we've done and in our work. And also when mistakes are discovered and mistakes are always discovered, they can go back and find the source of the mistake and correct it and learn from that exact stream what the consequences are downstream to our knowledge and experience. I feel like in the open source software community when I'm in that community working on software with people, I feel like I'm working with professionals. Of course, everybody has their foibles and some individuals are not as professional and others, but there's a set of norms which bind us and a set of tools we're expected to understand and use which make us more responsible. We still make mistakes, but we're trying. When I work with scientists and I tell them about things like managing databases and stuff, they blink and they say, oh, that sounds great. I've never heard of that. It's two different worlds and I would like increasingly to see software engineers and other similar people get involved in conversations and helping the sciences be better. Okay, thank you for your indulgence. I hope that was a little bit interesting and I'm curious to hear what you think.