 OK, so we've got another lightning talk from Philipp Durbin on advancing science with Dataverse. Thank you everyone for coming. Again, my name is Philip Durbin. I'm a software developer at Harvard University. This is a picture of our campus. We're across the river from Boston and the United States. And I'm here to tell you about Dataverse. So Dataverse is a community of data enthusiasts and specifically research data. So that means that we are scientists, researchers, and often we come from the academic library world. So librarians and data curators, data scientists, software developers like myself. These are some pictures from our annual gathering in Cambridge, Massachusetts. We have our sixth annual Dataverse community meeting this June. And everyone is welcome to come. We always play what we call soccer. And more importantly, for Boston, Dataverse is open source software. We're Apache licensed. There are 52 installations of Dataverse around the world across six continents. It has been translated into 10 languages and there's an opportunity to contribute there for sure. Here are some stats from GitHub of our repository over 100 contributors were written in Java. But I'd like to emphasize that we have APIs and client libraries for a variety of languages such as JavaScript, Python, R. So if you would like to contribute to Dataverse, there are lots of ways to get involved. And Dataverse, again, is for research data. We would say that it's open source research data repository software. But what does that mean? Research data. Let me give you an example. I saw this on Twitter a few weeks ago and asked this scientist if I can put him in my slides. His name is Arvin P. Rabakumar. He's working on climate change. And you can see here that he is tweeting his heart out. He is preparing a manuscript, a paper for publication in a journal. And he is explaining his argument. He is making data visualizations and all of this. And then he asks hashtag academic Twitter, if I have primary data, what should I do with it? So in the past, he's saying he's always put it under what's called supplemental information in the journal article. But one of the reviewers of his paper is saying you should get a DOI for your data. Now, a DOI is a digital object identifier. It's a whole thing. I was just in Lisbon this week for a conference called PIDAPALUZA, PID being a persistent identifier. But in the academic world, this is how we cite each other's work. This is how we acknowledge each other. We build up a graph of this work is derived from this work. We're all standing on the shoulders of giants. And so with Dataverse, what we're trying to do is elevate the data set to be a first class research object. So instead of just your papers, think about a citation for your data. In the end, I'm happy to say that the scientist decided to put his data into Harvard Dataverse, and then this is what that looks like. So Harvard Dataverse, and I have these pamphlets here, is a little unique among the 50 installations of Dataverse in that we accept research data from around the world and we'll host it for free up to one terabyte. So this is just an invitation to the crowd that if you yourself have research data and you don't know where to put it, or you know someone who does, please send them to the Harvard Dataverse and we'd be happy to host the data for them. Another thing I want to point out about this data set is that his raw data, his primary data, is only about half a megabyte in size, and yet you can see how rich the data is. He's exploring the data with data visualization. He obviously has a lot to say on Twitter about his data. We might call this the long tail of science. If you work in, say, biochemistry, you might have a natural place to put your data. Maybe you put it in the protein data bank, for example. But for a lot of science, there is no place for their data. So that's part of the need that Harvard Dataverse and the Dataverse project as a whole is trying to meet. That we want to welcome all scientists from all disciplines to publish their data. I want to talk a little bit about cultural change. And try to explain that people like the scientists we saw are very similar to open source developers. You can see that we like to share code and we are seeing that researchers are willing to share data. But this is a relatively new thing. And this pyramid is a diagram that's based on a tweet storm by Brian Nosek from the Center for Open Science. And what it means to me is that first we had to build the ability to even share data at all. That's at the bottom. And then projects like Dataverse have come along to hopefully improve the user experience for sharing data. I've stopped by the open source design table this morning and efforts like that are great. Let's not just have open source software. Let's make the software usable. Let's make it painless to share data. And then as we go up the pyramid, what we're seeing now is some cultural change. So again, the reviewer of the paper is the one who said, hey, you should make your data set a first class sightable scholarly object. So that's great. That's exactly what we've been trying to do for years is get there where it becomes a good scientific practice to share your data with the world. And increasingly I will say that funding these days often requires you to share your data. So university libraries and other places have a reason to install research data repository software like Dataverse so that they can have a place for their community to share their data. Also, I'll mention that on the journal side, the places that are publishing these academic papers, they are now giving incentives to researchers to share their data. They're trying to also move research towards more openness and more sharing of data. All right. Now I'd like to step you through quickly this concept that we have in my world of what we call the fair data principles. Fair is an acronym that stands for Findable, Accessible, Interoperable, and Reusable. Let's start with Findable. Part of the idea with putting data in a repository like Dataverse is that other scientists can find your work and reuse your work. So when you publish a data set in Dataverse, we send metadata, that's data about data, across the wire to a site called a non-profit, they're called DataSite, and this is an aggregator of all sorts of scientific data. A new player on the scene is Google. They have just brought out of beta last week or the week before a tool called Google Dataset Search. And so we've been working closely with them and putting all of the right technology in place so they can easily crawl installations of Dataverse and find the title, the author, the description, and make them all available in their new tool. And this third one is from a project called Share. It's another effort within academia to make data more findable. In this case, they use the Dataverse Search API to pull in the latest records all the time. These are a couple of screenshots of what these tools might look like when you're searching for data. The thing I like about these tools is that they expose the number of citations to the data. And again, citations are sort of the currency of the academic world. So here's a data set with 13 citations. That means that 13 papers are making use of that data, reusing that data. So we're really happy to see that data is being reused. We're hoping that this advances science. The second part of FAIR is accessible. It's one thing just to throw an Excel file up on an FTP server, but with Dataverse what we're trying to do is give researchers tools to explain exactly what their data is about. So we support what we would say is a rich set of metadata fields and Dataverse is customizable to the scientific discipline. So for example, there's a group at Harvard Medical School that has structural biology data. So they create their own metadata fields that matter to them. That's for the humans to read on the one side, but then we also support lots and lots of standards for interoperating between other data repositories. So XML, JSON in variety formats. The Google data set search, for example, uses a standard called schema.org. JSON-LD schema.org, the data set part of that. And then older standards like Delta and Core are in XML. There's a whole variety of that to make data accessible. For interoperable, the third letter in FAIR, I wanted to mention that the Dataverse is not trying to be all things to all people. We're trying to focus really on the research data, but we're very happy to interoperate and integrate with other platforms. So if a researcher is happy to use Dropbox for the early work in their study, that's totally fine. They can just get it into Dataverse later. Or other complementary tools like Open Science Framework, RSpace, it's like an electronic lab notebook, open journal systems. And then once the data has been published, we are happy to integrate with, or even before publication, I would say, that we're happy to integrate with computational environments. So Jupyter notebooks, for example, can be opened up in Binder. You just punch in the DOI of the data set from Dataverse. And there's a group called Holtail that is all about reproducibility. You may have heard that in science there is what we call this reproducibility crisis. And I'm not saying we're going to solve that problem, but we're trying to make an effort toward that. Reusable. And back toward that reproducibility problem, one thing we're seeing is that journals are increasingly requiring the publication of data in order for the paper to be published. This is a very positive thing. It's a bit of a big stick to hit researchers with to say, we're sorry you can't publish your paper until you publish the data set, but for scientific reproducibility it's a wonderful thing. Here's an example of this. The American Journal of Political Science has a replication policy that says you have to give us the data and also the code. And then there are a group of analysts at the Odom Institute at the University of North Carolina that will make sure the code executes, make sure that the plots in the paper can be reproduced. And then they give it the stamp of approval, and then the data set can be published, and then the paper can be published. So that's part of the story of reproducibility. The problem is that these poor analysts are downloading all kinds of software all the time through laptops, trying to reproduce the work of random data sets all over the world. The next step for us is to partner with tools like Code Ocean, again, Hotel Renku and Jupyter. These are reproducibility platforms. So instead of that analyst trying to reproduce the results on their laptop along with a lot of other data sets, what if they can click a button and have a Docker container spun up that has all the bits that they need to reproduce that work. So again, DOIs for Papers, DOIs for data sets, and maybe in the future DOIs for what we would call like an execution environment. So a Docker file, a Docker image, that's sort of where our thinking is going in the future. These fair data principles are in an academic paper that you're welcome to check out. And I'd also point you to a recent talk by Merce Croceus who's been leading the Dataverse project for over 10 years. We had an event in Tromsø, Norway just a couple of weeks ago where there were 19 countries represented and she gave a talk explaining this fair data concept from the Dataverse perspective. And I'd like to note that when I landed here on Friday, I was invited by Yusuf and others from the State Archives of Belgium and we had a nice little meeting of representatives from six countries all running Dataverse. And so thanks again for that. It was great to see friendly faces upon arriving in Brussels. I have a little bit of bonus content, two minutes left. This is sort of a thing that I believe strongly in that in the past, open source has been very open in its communication. Whether we're talking about the announcement of the GNU project or the announcement of Linux and discussion about open source and free software throughout time, we can still go back and look at that communication today. But what I see more and more is that lots of projects are using Slack, which is fine. We use Slack to say things like, hey, I brought in donuts, come on by. You know, it's great for that. But when you're thinking about your communities and you're making decisions about your projects and the direction you're going, I'd just like to encourage everyone to continue to hold to our tradition of openness. And so if there can be an acronym called FAIR about data, I thought I could make an acronym called SLOPI about communication. So SLOPI stands for Searchable Linkable Open Public Indexed. So I wrote a little blog post with more about what SLOPI is. That's that. Last thing is there's a group called Chaos. That's around. There was ChaosCon on Friday. And there's a project at Harvard that is called the Open Source Software Health Index Project. The idea here is that something developers like us naturally do all the time is compare two projects and say, well, which one is healthier? Which is the horse to bet on? And what we're trying to do is get towards a way to quantify some of this. So Chaos has built this awesome tool called Augur that will collect data about projects from GitHub Repos. And we're starting to mine that data a little bit. And I just want to put this project on your radar. And with that, I just wanted to say thank you. I don't think we have time for questions unfortunately, but please find me online. We have a chat room, chat.databurst.org. Here's my email, my Twitter. And thank you very much for your attention. Thank you.