 Hello, everybody. I'm Marcia Crozos from the Institute for Quantitative Social Science at Harvard University. We're here to talk about cloud dataverse with the joint talk with my collaborators, Oren and P&I, from the Massachusetts Open Cloud. And what we want to tell you in this talk, if we just have to send one message, is that data repositories need clouds, and clouds need data repositories. And that's why we build cloud dataverse. But we'll go into the details about that. First, where was the need? Why we are at a point that data repositories need the cloud, and the cloud needs the data. Then also we'll introduce our platforms, the dataverse as an open source platform for building data repositories. And the Massachusetts Open Cloud, built on top of OpenStack. And then we'll show you the solution when we bring MOC, the Massachusetts Open Cloud, and dataverse together. So first, we're not the only ones that have seen the value in data. AWS, already for the last few years, have realized that data is important, and it needs to be close to computing. They say in their AWS public data sets initiative, they say, when data is made publicly available on AWS, anyone can analyze any volume of data without needing to download or store it in themselves. So they bring data from different fields. One of the known data sets that are in AWS is the 1000 Genome Project. This allows, by having the data close to computing, allows to analyze those data sets from the volunteer 1000 genomes. And from there, you can find variants in mutations of different populations within a population or across populations. So it's a new discovery for science. But that also applies for data from many other fields and from industry, right? And you could get, and just having this access to data close to the computing allows you to discover new insights. But AWS, however, doesn't provide all the features that you would need for a data repository in today's world. One of the things that you need first is to have all the incentives for data sharing so that those data authors, if you are one of them, for example, that have collected data, prepare the data set so that it can be analyzed and have worked on cleaning it and processing, you have spent a lot of time working on that data set. And you might be okay making that available for others so that you're not the only one discovering insights from that data. But on the other hand, you wanna get credit for that. And we found that with formal data citations, having a way to provide a reference to that data set that gives you attribution for the data set is one way of building an incentive for sharing your data. These type of citations or references to data sets need to be done in a way that you can cite every version of the data set. The data set, in a way, is a living object. It might be, it's not just a publication, like a book or paper, but it just, it will continue having new versions and you wanna be able to reference the version that you had used for your discovery. Those data sets all need to provide enough metadata so that they can be discovered easily through discovery indices or other systems that go and search for data sets. And very importantly, the data repository needs to support features that will allow to have tier access to the data. So in cases, not like, well, AWS public data sets, that all the data sets are completely public and open, but in cases that you need to have a certain license or data use agreement or some other restrictions to access the data, you can provide that through your repository. And then finally, that repository needs to have a commitment to archive that data. So if you have reference, you're referencing your data from some other source, you can guarantee that that data set will still be accessible over time. So it turns out that the scientific community have been working with data repositories for a long time. They, across different research fields and across different well continents, they've been building data archives and repositories that provide this access to data. It started in from the late 1950s, not the way to our current times, and it goes again as you see across social science, life sciences, earth sciences, and astronomy. But today's data repositories provide additional features to those, well, to those data archives that were from last century. So they also have figured out, as I said before, that incentives for data sharing are important. They provide a platform that would allow you as a data generator, data author, to upload the data and make it available to others, but you continue with having control over that data set. This started with our project, the Dataverse Project, in 2006, with a system that provides the software for building this type of data repositories, of modern-day data repositories. There are other examples like the Dryad data repository, Fickshares and other one others. And if you see in the R3Data website for all the, is that, what happened here with, okay, well, okay, it's back. Is it okay? Can I continue? Okay, so something is happening. So the, well, we saw in the previous slide, I'm afraid to move it because, back to that slide. But in the previous slide, we saw that from, in the last decade, the number of data repositories have increased enormously. And again, with the point about building those incentives and building data citations, what that means is that when you have a bibliography, a reference, oh, this continues moving, is that, okay. Let me go back. So when you have a bibliography, you would have not only references to other publications, but also references to data sets with the data repository that that data set can be found, and with attribution to the data authors. So the Dataverse open source platform has helped over this last decade for others to build, other different organizations to build data repositories of many types. Some of these data repositories would support multiple universities, like for example, one in the, in, hosted in a university of Texas, but it supports 20 different universities in that area. There are some that have to be, that are set up in China because that way the repository has more control of the data and they can, they, for legal reasons, they need to have the data in locally in China. And this is moving by itself. So sorry that I keep going back, but I'll try. And there are data repositories that are specific to a domain, like for example, an agriculture data repository that you create set. But what are the challenges with a platform like the Dataverse software or other current repositories? So they only support small data sets. You cannot copy, delete, parabytes of data to those repositories and deposit it in them. And also they, not everybody, when you need to download the data or you need to access it for computing, not everybody has all the tools that would enable computing of those data sets. So it misses the computing and the support for large data sets. So we came to the conclusion that Dataverse needed a cloud. And I'll pass that to Oran, now. So I want to, which is this sort of regional project that was going on sort of before this connection. So this is a cloud. It's involves five of the largest universities in the world actually. Boston University, the whole UMass system, Northeastern, MIT, Harvard. And just to give you a feeling for what that means, this is the whole Pacific Coast of the US. This is the Pacific Research Platform. It's all these institutions that cover that entire area. And if you take that and you kind of compare that consortium to the MGH PCC consortium, which is what the MOCs based on is they're actually equivalent, right? They're about the same number of students. They both have these massive communities of scientists covering every field of research, collaborations across the globe, massive amounts of data and computational requirements. And we're talking about 175,000 students, but there's one little difference between the MGH PCC consortium and the Pacific Research Platform is imagine taking that whole thing and smooshing it into one building. That's what we've done here in Massachusetts. The MGH PCC data center, we built a shared data center, 15 megawatts, 90,000 square feet. You're talking about two acres of space. It already has tens of thousands of HPC and we have still tons of space to grow in this. So we have all that infrastructure smooshed into one building, right? And that gives us an incredible opportunity to build the cloud, which is what we've been doing with the MOC. It's actually involves those five institutions, the Air Force, the Commonwealth Massachusetts, five major corporates. In fact, Chris Wright and his keynote just a little while ago was discussing sort of the shared relationship between Red Hat and this and a lot of other contributing partners. So a fundamental difference between what we're doing is today's clouds are basically a black box. They're stood up by one provider that controls the whole thing. You can't see inside it. We wanted to create a more open mall of cloud, what we call an open cloud exchange where different institutions or different companies can stand up infrastructure, charging in different models for that infrastructure. Different groups can actually stand up production cloud environments on top of this. And we can also research environments on top of this. So this enables research all the way down into the sort of fundamental platform capabilities. And many different environments stood up on top of this to solve problems in big data or web or HPC. And in fact, where we were funded from the Commonwealth was to help enable the big data ecosystem of the Commonwealth. So we started off with OpenStack, which is great, but where's the data? You know, we needed a fundamental capability. We needed to be able to share data with this new open cloud exchange model between different providers of infrastructure. We also needed to expose all that cloud metadata up to researchers that might wanna analyze it or to users. That source of information seemed to be dramatically important. You can't see operational information about any existing cloud today. Our scientific users needed many of the same things that Merced was talking about. You know, they wanted to not be able to deal with petabytes of data or at least many terabytes of data and downloading that over the internet was kind of horrific to them. We wanted to have the public data sets like Amazon has, but we also wanted to have community data sets. Data sets shared with astronomy or within a group in biology. We wanted to, scientists actually spend enormous efforts creating those data sets. They didn't just wanna make them public, they wanted to actually control who could access those data sets and know that they were getting citations for all the work that they'd done creating those data sets. And they wanted to have, be able to exploit the rich tools, you know, be able to stand up rapidly a Hadoop environment or a different kind of analytics environment on top to compute on that data set. We also had this rich set of industry and public sector companies that were actually and companies and public sector institutes involved in it that wanted to put their data sets up there to share their data with all kinds of researchers and startups, again, part of the mission was to actually enable startups in Commonwealth. But we actually hit some really interesting things. They did not want to put their public data sets just, or make them public data sets. For example, we're working with the MBTA. They have this massive data set for where every bus has been for every second for the last 10 years, right? So, cool information could do all kinds of interesting stuff on top of that, but it also has inside the name of the bus driver, right? So, you know, can you imagine these public institutions, they would be loved to have all kinds of people doing analytics on top of it, but they're not gonna go through the work of anonymizing these data sets. They want to sort of put it out there, allow it to be discoverable by people, and then make an agreement to actually make that data set available to them. So, they want to be very open, but if you're gonna add this barrier that it has to be public, then you're never gonna get it out there in the cloud. And the mall where all these data sets exist so we could have all kinds of companies expose their value in an environment where all these data sets exist. So, we really realized fairly rapidly and we're excited when Merce met with us and we started talking about the synergy between these that the MOC needed a modern data set repository system. But more fundamentally, we think this is a fundamental requirement for OpenStack. There's nothing so esoteric. I mean, we have an immediate demand for it, but there's nothing so esoteric. You know, we think that data is gonna drive compute in the future and there's nothing like that in OpenStack today. So, now P&I's gonna talk about the work we did to integrate this together. Good morning and welcome to Boston for some of you out of town. I hope you are having a good time. I always do at OpenStack Summit. So, let's see what we have. Merce mentioned about data was what it's all about. Public data set, incentive to the order, great. We built MOC, messages at OpenCloud. We have couple of clouds in our data center at MGHPCC. So, this is data was before cloud data was. What do we have? We have obstacle here. We have the internet. You put the repository in one place, not co-locate in your compute platform. In this case, it's OpenStack. You got the internet bandwidth that you have to deal with. That's not fun. We know that. This is data was after cloud data was specifically with OpenStack. What OpenStack brings to the repository like data was is compute. You have Nova. You have Swift object storage. And you also have UI. You need a way for user to get access to it. Awesome. So, what's the problem here? What missing from the OpenStack cloud? We need a bridge between the repository and the compute platform. So, we built what we call Gigi. It's a simple UI, very simple. And by the way, there's a talk about Gigi later on. Anyone interested in it? So, Gigi is a bridge between OpenStack. It talks to different services in OpenStack. And then it also serves as a gateway into repository like data was. Now, what missing in data was? Somehow, you need to allow the user to get to the Swift endpoint there. So, what we do is that in the data was, we make change to the code so that instead of uploading to the file system, you're uploading to the Swift object directly. And then we create a push button, compute button in data was. Those are very simple changes in data was and voila, we got the data was. So, in summarize again, summary again, we put Swift, we make change, make sure that data was can upload to Swift. We have the compute button data was and now we got data was. Next, we would like to demonstrate. We have a very neat, I thought so anyway, a demo that demonstrate is about billion object platform at the Harvard business, at the IQSS at Harvard also. And I will let the demo speak for itself but it will demonstrate how you can take the tweets and then put it in the in data was uploaded to the Swift. We walk through from the Swift endpoint to Gigi. You can be back and forth between Gigi and Horizon and then bring up Sahara to build. And finally, you get to the Spark cluster from the Gigi and you get the report, okay. Before, so we know it works. We start again there. Okay, let's see, we can start this from the beginning. So, this demo here, we starting off with the Bob platform is Geo Tweets. Indeed, you will see on the screen there that you can specify the time date and time frame that you want to take the tweet from and then this is like spatial, Geo spatial and it let long. So, let's say that we want to look for the word quote from Bob and ask it to search for the keyword come up with about million, million tweets there. And you also can zoom in using Bob. You can zoom in certain area. Once you zoom in certain area, you will see the number here change to the number tweets actually happen there. And then from there, you click and then each time you click on each of the tweet, you can see that it does go to where it got mentioned itself. So, in the Bob here, there is a button, okay? That he just click the dataverse button and then we'll come to, this is the dataverse, cloud dataverse, and then you log in. You have to log in in order to get access to the data. Come here, you will see that you have the Bob Geo tweet code data set there. You click on that and then you'll notice that we now have the container because you click the dataverse, the data set got uploaded to Swift endpoint with the container name. And you see that when you, the container name, you have the container name there so you can do anything you want from then on. And there's another way that you also, that you see a compute button there. That's the out gate way into OpenStack. You click on the data file itself. There's also another button there with the compute button. So, this is a document UIL, right? I mean, once you're in the Swift, you can take this data and do any compute you want. You don't need OpenStack actually. If you just want a simple, like what it's demonstrated here, you can get to that data set itself. Now, if we click on the compute button, now we get into OpenStack. What we show here is GG. This is a simple UI I mentioned about. So, you come in here. You basically talk to Sahara. We build a cluster, launch a cluster. We can do a small Spark cluster. Give it a cluster name. Now, anyone that knows Sahara would know that there are many, many forms you have to fill out before you get to this point. What we have done, is just have some couple of defaults for user. And you see we launch a cluster. You see that it's pending the job. And then now, I think you're gonna go to do the job. Run the job there. Create a job that you like to do. A simple word count on Spark. And then the input. If we're gonna put the star there, just capture anything that's there. And then specify the output. Click on launch. Now you have your cluster. You have your Sahara cluster, Spark cluster. And now you run your job. You click on, there's a button on the Gigi side that you just click and that go, this is horizon. You will see that the job just run there. And you can go to your Spark cluster here. And you see that the job is running there. All this is the actual time. We didn't cut out any time at all. So, right now the job is running. This is gonna take a few minutes, about 13 seconds here, that we'll wait. So we want to make sure we get the effect of the real job that's running. And you'll see, go to see the container that the job is running. And you should see the result. It says successful. And this is the part. Okay, so that's the demo that we have. You see the journey from when we upload the, let's say, let me summarize it. We starting with the scientist want to, Lucy, want to upload her data and view on it. She start doing that by go to Bob platform. She tried to see what in the tweets that she interested in. And then she click the button, the brown button there to go to data worse. And once she done that, the data get uploaded to sweep. And then Bob will bring her to data worse. And you see the green button there, that's a compute button that we gone through from the demo there. Once you click that, she can get into Gigi. In the meantime, on the sweep there, you can do any compute or whatever you like to do. Once you get to Gigi though, you can go back and forth. And then you can go to Sahara. Bring up your Spark cluster. And then Spark will produce a report to you. At any time at all between Gigi and Horizon, you can go to Horizon from Gigi. So how long it took us to do all this? We started out back last year in the summer. We're starting out doing some POC just to try out to see if it would work at all. And then it showing up, I mean it works fine. And very simple, this is a very simple concept here. It's nothing complicated, but it works. And then back in the fall, we went to data worse community. Don't forget that data worse, it's just another open source project. So we went there and asked if this would be useful to them. And they say yes. We also showcasing this at the Baselo now doing the V-brow back. Got very good comments out of that. And then back in December, we have our MOC annual workshop where we demonstrate this to our user, Cloud user. And they say, this is a great feature. We would like to see it. So we start all this in January. This January with full collaboration between the two teams. And now at the summit, we have the full Swift feature in data worse that is up on data worse repository GitHub. It's got merge, I think like last week, last Thursday. We also have Gigi that MOC built. And then we bring up you the demo. And now in the summer, what we would love to do is to do worldwide data federation, which we say gonna talk next. I want to skip this. Okay, just so you know, there are a few other talks related to this one. With demos, you can find information in the schedule. So yeah, hopefully we've convinced you that value of bringing a data repository platform into an OpenStack cloud. That way, making it easy to bring data into the cloud and in a more reliable way with more features that will support the data sets. But it can get even better. Our goal is not only to have a solution for one cloud to have a repository in that cloud, but also have the network of data-verse repositories around the wall connected in a way that they can all upload easily those data to one cloud data-verse. And that way, you have a federation, a worldwide federation of data sets that can be computing the cloud. And each repository can enable that in a way that makes it very easy to make a data publicly, either publicly with restrictions, but in a place close to computation. And that's one of the next projects as part of this cloud data-verse. So to end this talk, we started with the data repositories need clouds. Clouds need data repositories, and we showed you that with cloud data-verse, we combine the power and scalability of OpenStack cloud with the need to access data using a feature-rich repository. And we haven't done that alone. We've done that with a team of a data-verse team of developers and the MOC team also, developers and others. So we thank our teams and thank you for listening. We have time for a few questions if people need to get to lunch, we understand, but... Any questions? Yeah, if you can go up to the mic, that'd be great, so it's recorded. Hello, okay, since Louder. So since I got Gigi, it has to... I mean, this data-verse needs to work with Gigi to talk to Horizon and Sahara. So why isn't this Gigi be part of the data-verse? I think, first of all, it's not necessary. You can take the URI. It's in Spark. You can take the URI and copy it manually into the other stuff, but really what we want to do is, just like OpenStack, there's a series of different projects, right? And Gigi, you'll see a whole talk on it this afternoon, but it's actually not intended to supplement Horizon. People are going to want to do complex things in Horizon at some point in time. Gigi is, though, meant for our users to be able to have a fast experience to do the most common thing. Yeah, to add on to that, data-verse is not compute platform. Data-verse is a data repository, right? So, and beside OpenStack, I believe there's a future coming up that's gonna be integrated with other cloud, for example, Azure from Windows or even AWS. We don't know. So we don't want to mix between the two features, which feature that they're, and that's why we build a separate thing out of it. I see, okay. Thank you. Any other questions? All right, so what we'd really love to do is we'd love to get this as a, essentially another project in OpenStack, although, like other things like Keystone's broader than that. So please, we'd love people to participate, join in on this effort. Yes, and I have to talk to us about ideas, suggestions of how to make that project, how to integrate it with other pieces of OpenStack, if needed, and we'll be happy to come and contribute to our project together. And we think this is a fundamentally missing part. Like when we serve, when we saw this from the MOC side, we felt like a data-verse-like thing was a fundamentally missing part of a cloud platform. So I'm kind of shocked that nobody's been doing this till now. There was a question back there? No, the two really aren't related. It's like this is about, this isn't about the open cloud exchange part, although you can tie this into where you store the data. Right, so one other thing that I didn't mention, right, you talk about OpenStack services, one other thing, there are two pieces there. You have services that OpenStack Cloud bring, but you also have resources that you can provide it to the data-verse user. And those resources include petabyte of storage space, right? So one of each in OpenStack Day is the fact that if you're in one cloud, you might not have the storage there, but there's another OpenStack Cloud that has a storage. We do have a project mixed and mass federation. The GitHub is up on OpenStack right now. And with that project, we are hoping someday to go to Big Ten, we don't know when yet. But with that project, if you are in one cloud, you can go out and get storage from another cloud with your own identity. You don't have to go and have identity in the other cloud. Not necessarily, you don't have to log into the other cloud to do that. So there's a feedback from last year, last summit that talk about mixed and mass federation. More fact, the particle is here, Christy. You're interested in talking to him about it, yeah. But these aren't directly related. This is a way of storing your data and you can control where you store it eventually. Is this working? Yeah. So data-verse is one of many places where researchers could put their data. There's obviously lots of domain or discipline-specific data repositories as well. EBI is one of them. I see that there's, this is a great opportunity, but the diversity of platforms to store published data is huge. We have the same problem, getting research data in and out of our cloud environment across this diversity of different solutions that are distributed globally is a big issue and a big challenge. So I'm just, I guess, questioning and wondering what an effective, I think it's a very good problem to think about and solve, but I'm wondering what we can do to uplift everyone. Is a focus on data-verse appropriate? Is a framework to enable multiple repositories, entry into OpenStack more appropriate, and so on. So I'm kind of just thinking about what sort of model would work best to uplift this data compute. You're saying that compute's gonna be data-driven, agree wholeheartedly, so how do we uplift everybody? Right, right, and it's a good question and a good, in some way, a good problem to have because we do wanna work with other repository platforms so that this could be, I mean, at the end of the same way that P&I was saying this could be a solution for other type of clouds and not only OpenStack clouds, it can also be a solution for other type of repositories. I do wanna clarify, though, that there are many, a very large number of data repositories and many that are domain-specific, but there are much fewer platforms that are open source with a community of contributors that are building repository software. There may be a handful of them that are actually in use. Dataverse is one of the more widely used, so we don't provide, we have a Harvard Dataverse, for example, at Harvard for any contributed to deposit the data there, but we don't provide a repository solution, only one repository, we'll provide the platform to build any type, so MBI, some genomics data could build their repository on top of Dataverse. So we thought that the choice of Dataverse with OpenStack gives two open source solutions with a community that has the most, it can support the largest variety, the widest range of repositories and clouds. So that's the main reason. So as an external part, like we, P&I and I aren't part of the Dataverse stuff, so we kind of were evaluating different alternatives and it was actually what Mirce said, it was first of all that this was an open source platform, B, that it actually could support the customization for multiple different domain specific types of data, and C, talk? Data citations. So there's a series of, Mirce said, there was a few of these things that did that, but the things that differentiated this was that it was a platform rather than something specific to one domain, and if we wanted to create cloud that supported multiple different domains, there wasn't actually that many options around, and this has a lot of traction. So it's like anything in OpenStack, right? You end up having a reference implementation that is there, I think this is a very, for us we wanted a full solution, this is a very reasonable reference implementation, but it should support alternatives that different people can put into it, but I think it's a reasonable reference implementation. Exactly, and I think that's what we think that we're being pioneers is to bringing this data sharing and data repositories close to cloud platforms, but this is one implementation. Do you want to go to the mic, Peter? Sorry, they're recording. Peter is part of a musician, from Madison University. Yeah, so just adding on to that is that, this is veering a bit from the OpenStack side towards the scientific researcher side, but like in computer science, there's a bunch of different ad hoc repositories of things, but what we've certainly seen in Dataverse is that this is actually, it has support for citation. It has a whole bunch of features that, I mean there's been scientific librarians involved in this whole thing. So it has a lot of features on the archival and science side of it that are important, and I'm sure that there are others in fields that as a computer scientist I'm not aware of, but certainly in CS there aren't. So I am curious, but the other thing that really kind of resonated for us too is that our industry partners wanted to publish data sets, make them available and discoverable immediately to everybody, and then make them available so that people could come and request access to the data set, rather than just making them public data sets. We'd love to also hear of use cases, so after this people could sort of describe to us, because our feeling is this was designed for scientists, but it seems very, very general purpose to us and a value to a much broader community. So I'd love to hear of other use cases and features we need to do. Yeah, that would be great. And also if you wanna learn more about Dataverse in June 14th, 15th, and 16th, there is the Dataverse Community Meeting here at Harvard University. You're welcome to join us, and there will be again several related talks to this afternoon on Wednesday, right? Yep. Thanks. Thanks guys. Thank you. Thank you.