 Hi, everyone. Welcome to our talk. When Dataverse meets OpenStack, Cloud Dataverse. So, quick show of hands. How many of you here attended the talk on Monday given by Merse or an MP&I on Cloud Dataverse? Okay. It's fine if you didn't. This is going to review some of that and then go into more detail. But had you been there, the last slide that they ended on was on this one, which says, data repositories needs clouds. Clouds need data repositories. And that with the Cloud Dataverse project, we combine the power and scalability of OpenStack Cloud with the need to access data using a feature-rich repository. Their talk was more focused on the strategy and the need. Ours is going to be more details about what Dataverse is and then the technical implementation of what we've done. So, I'm here with my colleagues, Leonid, and then from the Mass Open Cloud, Jeremy. I'm Gustavo on the Tech Lead on the Dataverse project. I'm going to start giving some intro into the Dataverse, its features and technology we've used. Then Leonid will tell a little bit about what we've done to add OpenStack support to Dataverse. And then Jeremy will give a demo of the Cloud Dataverse in practice and discuss the architecture of that. We'll end on some future considerations and questions if we have time at the end. Okay. So, Dataverse is an open-source platform to share an archive data. It's developed by us at IQSS at Harvard and we've been working on it since about 2006. In the ten years or so we've worked on it, we've realized that there's lots of different people who want repositories for their data and have lots of different needs and use cases. So, when we re-architected the project about four years ago, one of the things we made sure to do was to build support for lots of different types of users, lots of different types of data and the different workflows that they all need. We've developed it with funding by IQSS, but we also get funding from grants and we work with collaboration with a lot of different institutions, including, for example, in this case, the MassOpenCloud at BU. Our core development team is 13 and it's developers, we have designers, we have UI UX people, we have a metadata specialist, we have a curation manager, we have a project manager and me as a tech lead. It's an open-source project and so with that, we developed the software ourselves and we run one of the installations, but there are 22, actually this map is a little outdated, there are now 23, we got a new installation in Indonesia last week, who are running the software and versions of the software. In addition, besides having the community to use the software, we also have a lot of code contributions from outside our core team, so up to now I think we've had about 38 code contributors and we have hundreds of members of the community who are all interested, the developers, the researchers, the librarians, the data scientists and we have several different communication channels with which we interact with them on a daily and weekly basis. There's a Dataverse Google Group mailing list that is active and people email and respond and it's nice because when we first started the list, we usually were the ones to respond, but as a community has grown, a lot of times other members of the community answer before we even get to it and are able to help each other out and the community is itself is growing. We provide bi-weekly, dataverse community calls where anyone is interested in the topic that we're going to discuss that we call in and we also have an open forum for questions then and we have an annual Dataverse Community Meeting which if any of you are interested in this, it's going to be next month in June, June 14, 15, 16 and across the river in Cambridge. So if you're interested, let me know afterwards and I can get your information. So this slide is just a link. It's a link to our roadmap. When we have long more time, we click into the roadmap and we look at details of what we're working on in the future, but I wanted to put it in here so that if you later decide to look at the slides in more detail, you can click on it and look at it yourself, but I'm going to focus on more just what's there now and not the roadmap because that gives more time to these guys to get into the more technical, interesting details. Okay, so some of the features that Dataverse provides and a lot of this, if you want to go back to that earlier point where it's built around multiple different kinds of users and data and workflows. So a lot of the features are built in ways that are meant to be dynamic or able to support these different kinds of needs that our users have. So we have multiple ways of signing in. We have a native authentication system, but then we also are able to sign in with Shibboleth if you want to sign in with your institution. And we also have OAuth support, so for things like GitHub and Orkid and Google, you can log in with that. We provide different ways to do branding so that each installation can brand as their own installation. So even though they're running the same software, it'll look different and they'll be able to put their logos and their information and links to their project sites and things like that. We'll talk a little bit more about Dataverse within Dataverse in the next slide. I'll explain a little bit more what a Dataverse container is and how that works, but I have a couple nice diagrams that a colleague drew up. One of the things that's important in the data publishing world is to have citations, so our software automatically generates citations for your data and as you make changes and create new versions, the citation version number will change so that people know which version of the data that they are citing and referencing. We support domain-specific metadata, so this was a key one in that even though we have created it coming out of a social science institute, we wanted to make sure that the software would be able to be used by researchers and scientists and users of all different domains. So we built a very dynamic, flexible system to be able to create metadata blocks, basically is what we call them, of different kinds of metadata. So by default, when you install a Dataverse, you have the citation metadata block, and then that one's very important and is required because it's used to build automatically generated citations, but then you can also use a social science block, an astronomy block. We've worked with a biomedical group to have a biomed block, and the software is built to be able to support to easily upload and add new blocks as we work with different experts in their domains to figure out what standards they use for their metadata, and so any individual installation of Dataverse can support multiple domains, whichever domains they're most interested in. At Harvard Dataverse, for example, we are open to data from everywhere, so we have all the blocks enabled. As I mentioned earlier with the citation, we have versioning, and so you can have minor versions and major versions of your data, and so if you add new files, for example, you'll create a new major version, but if all you did was fix a typo, maybe it's just a minor version, version 1.1, 1.2, etc. We have a lot of different publishing workflows, and this was important because we have use cases where individual researchers want to have their own Dataverses and have full control and publish their own data sets, but then we also have a similar, a different use case, excuse me, of journals where they want to maintain the publishing control, so they allow people to contribute and upload data sets, but the final authority on publishing is on the journal themselves and the people who upload the data cannot do the publishing, but there's a workflow mechanism there for that when it's ready for publishing, they send it to the curator who then reviews it and then can either publish it or send it back to the author. One thing that was also very important to us with our rearchitecture was to provide robust APIs, so we have the web application, and we'll talk a little bit more about the technology there in a minute, but we also created a lot of robust APIs so that anything that you can do via the web app you can do via the APIs and you can build your own UI, you can do automated things. We have a lot of partnerships with different groups which use the API for ingesting new data into the system and or for downloading and searching and things like that. And then the last feature that I'll talk about here is harvesting, and that's the idea of getting metadata from other installations of Dataverse and other institutions, so we use what's called OAIPMH, which is Open Archive Initiative, and it's basically a standard that's used for this kind of thing where as updates are made to the data, it will tell the client that the updates are there, and so, for example, Harvard Dataverse, we go ahead and we harvest from all the other Dataverse installations that we can, and also from non-Dataverse but other installations that use OAI, so when you search at the Harvard Dataverse, you can search for metadata, not just from us, but from everywhere that we harvest from, and other other installations are doing the same thing. There's one of our partners in Odum also does a lot of harvesting of everything, and so it just makes the data much more discoverable because you don't have to go specifically to the place that's hosted to find it. Okay, so this is the schematic I was talking about. So a Dataverse you can think of basically as a container, and one of the things we realize is that to be able to model the different types of institutions and researchers, we needed to be able to be flexible and have containers within containers, so a Dataverse originally was created to hold data sets, but now we've modified it to also hold Dataverses, so Dataverse can contain multiple Dataverses, each of which could contain other Dataverses, and then all of those levels can contain different data sets. And the idea is, if you're an individual researcher, for example, maybe you have your Dataverse, and you have your five, 10, 15 DataSets in there, but if you're an institution, maybe you create one high level Dataverse, and then below it you create individual Dataverses for the researchers, and things like that. Again, using our installation, because that's the one I know best, at Harvard, we have a high level Harvard Dataverse, and then within there we have different Dataverses for all the different institutions that want to put their data on Harvard, and then within those Dataverses they might have more control and create some sub-Dataverses or just upload their DataSets directly. But the core of the guts, I would say, is the data set itself, that's what contains the collection of data files and supporting files that you want to make available to the community, and also contains metadata describing that data set, and that's where we talked about the different dynamic metadata to be able to richly describe the data that you have. Okay, so this is our technical stack, basically, being open source. We use open source products. We use GlassFish Server 4.1, it's a web application, and in using GlassFish it means we're able to use the latest Java SE8, Java SE9 is coming out soon, and hopefully we'll upgrade to that sometime within the next year. But right now we're on Standard Digital 8, and then we heavily leverage the Enterprise Edition, that's what EE stands for, of Java EE7. So Java EE7 is composed of lots of different modules, and so we use things for all different, for the presentation layer, for the business layer, for the storage layer. So for the presentation we use Java Server Faces and our restful API stuff for business. We use EJBs a lot, and we're able to create transactions. Asynchronous, for like things when you ingest large files, it can take longer, so we can return feedback to the user sooner. Timers for things that we want to run overnight, and are also longer run processes. The backend we store with JPA, it's Java Persistence Architecture, and we use bean validation to make sure that the data that gets into the database is valid. But what's important related to this talk mostly is the storage. So our data, and the metadata is all stored, for example, in the Postgres database. We use solar for indexing and search, and then the files themselves either live on the file system, or what we've added with this collaboration is now they can live in a Swift object store. So now I'm going to pass you along to Layinit, and he's going to tell you more about what we've done to add this support. Hi. Hi, I'm Layinit. So I'm going to be talking about what it is that we had to do to add this OpenStack support to our application. So one part of it is direct access to cloud computing. This relationship between Dataverse and MOC, the goal of it was not to merge our applications and or to duplicate functionality in both places. It was to bring the expertise that we already have together and really set. So for example, as we're talking about direct access to cloud computing, we have massive amounts of expertise in publishing, in accumulating metadata, in allowing people to make their research discoverable with other institutions. But now some of our researchers want to allow their users to run computations on their data. We don't have any infrastructure for that, and we didn't want to get into business of implementing that. So that's where MOC essentially comes into. They already have that infrastructure, and basically this access to cloud computing really is a fancy way of saying that we are sending our users to their side. Some interactions between our software platforms need to happen for that. We need to provide some metadata describing just what kind of data files we have in this data set. They need to provide certain amount of metadata, explaining what kind of facilities they have. But once that exchange happens, we basically just tell our user, click this button, and off you go to the MOC. So I'm not even going to talk about this part much, because Jeremy is going to be describing it in much detail, and he's going to show it in his demo. And I'll be talking about the second part, the storage driver, because that's the part I was working on. So again, in our, it's kind of very, very simple schematic diagram, explaining what happens as we store our files. In normal operations, our users upload files. They get stored in some local file system, some giant file system appliance. Of course, the reason level of abstraction that separates the actual physical storage hardware from the application. And there are some drivers that implement that storage. And a certain amount of metadata about each file is stored in a database that basically tells the application what, that tells the data file reader and writer layer like what driver needs to be used to access a certain file. So we provide our normal standard default file system, storage driver. There is a read-only storage driver for accessing files on remote HTTP server. So basically we had to add another driver that provides read-write access to Swift storage. There wasn't much code really, and none of it implementing this was not exactly rocket science. And that's the main message of this talk, is that basically we were able to add this access to very significant amounts of functionality to our users with relatively little code and very straightforward coded that. So the way it's done, I mean, the data file I owe is like a very standard abstract interface with that individual drivers implement. So it's your standard Java inheritance. And we define certain methods that you need to read and write the files. You can get like a Java NIO channel. You can get input or output streams. And as an example, this is a fairly useful code like SafePath. If you have a file, usually in our workflow scenario, a user uploads a file. It gets dumped on a temporary file system. And then when you're ready to save it permanently, you call this and it ends up in its final location. So this is the implementation for the file system. Really, it really ends up being implemented as a one line, as a one line of Java NIO code. It translates one to one to this files.copy method. And for Swift, we ended up basically implementing the same thing with just one line of Java code, this upload object call. So again, it's like it's really equally, it's similarly straightforward, pretty much throughout that added Swift storage implementation. And again, just to illustrate it like to our users, going to our interface, looking at files, it's totally transparent. One of these files happens to be stored on the local file system. The other one on Swift, they both look the same to our users. But the database that stores the metadata, like describing these files, just happens to mention the actual physical location of the file. And that's enough for the storage layer to know where to look for the file. And it just happens to work. So basically, there are two basic scenarios for that. Yeah, we allow our users to go to the application and upload their files the way they normally would and just have the files magically ending up getting stored on Swift. The other scenario is that somebody may already have a massive amount of data on Swift and we just provide them with a way to import the files into the data version, just making them discoverable and available through our normal functionality. And again, for local files, files end up being stored on the file system for Swift, they end up being stored on Swift. And with some, like in this, I'm really just showing the directory listons right now. It's not particularly exciting, but it shows, among other things, that for certain files we generate additional real-time permutations that we call derivative files. So for an image file, we'll generate automatically a few thumbnails in different sizes. So again, for a file stored on the cloud, these thumbnails just were magically generated and stored there. And it just works transparently without any interactions with the user. And on this note, I'm passing this to Jeremy, who will talk about the MLC side of this collaboration. Thank you. Thank you, Leonid. So now I'm going to be talking about cloud dataverse in practice. In particular, the consequences and benefits of the changes we've made to dataverse to become cloud dataverse. And to illustrate these points, I'll be referencing the pilot implementation of cloud dataverse at the mass open cloud. So this is the architecture that we've inherited. So you'll notice right away that it's not particularly helpful to users who want to actually analyze the data and dataverse using compute. So whenever you see those blue arrows, that represents the transfer of data sets. So in this case, with the existing dataverse architecture, the data sets have to go over the internet first before you can analyze using compute. So this isn't very overtly pleasant, and there's definitely not much incentivization for analysis. And really, it's not too convenient. So one sort of intermediate step we can take to improve this a little is at least if we swap out a traditional file system for object store, we can at least store some larger objects. But what follows from this is now essentially now the compute platform can now access the data sets stored in dataverse directly. So now already the workflow is much more simple and convenient and we're starting to see some incentivization. So right now, at this point, the architecture that you see is probably suitable just for smaller data sets. You could imagine a user spinning up a single VM and then they could use a programming language like R or a Python library like scikit-learn to analyze data sets probably not too big, but at least it's convenient for them to get started. Luckily, though, since we were using OpenStack, it was easy enough to integrate a big data analytics platform as well through the use of OpenStack Sahara. So if you're not familiar with Sahara, it's basically a component of OpenStack that offers big data cluster provisioning, and this allows users to run big data applications using Hadoop, Spark, Pig, Hive, and Storm. In particular, Hadoop, Spark, and Pig were of particular interest to us because they offer the tightest integration with the Swift Object Store. Sahara also offers a simple abstraction for a job submission, so we didn't want to have to force our users to be messing around with the command line just to do their analytics work. And as I mentioned, with the applications launched by Sahara, the Swift integration for job input and output works right out of the box. So it was definitely important to introduce this extra layer or service in our architecture since the convenience and locality that we've been stressing is very important at this scale, especially when you're talking about really terabytes or theoretically petabytes of data. So growing out of that same idea of trying to scale this solution when you're dealing, like I said, with terabytes, it helps to have essentially a centralized storage solution. So if you think about it, if everyone's trying to access the same thing, why should they have to copy the whole thing in its entirety before they do their work? And this is especially a useful notion when you compare it to the sort of more traditional approach with HDFS where the user is responsible for copying all the data into their compute environment first, especially with that approach, the user will probably want to persist their compute environment between job executions, which is a big burden to the user and will probably end up costing them money. But with a centralized storage solution, now the user can take advantage of transient compute transient data processing clusters. And then if you're worried about the performance of running big data applications that access Swift, there's definitely some performance improvements to be made if you stick around for the talk, which immediately follows this. You can learn more about some of those improvements. So there's still something missing from this architecture. In particular, when designing this system, we needed something really to tie everything together, really that could offer tight and specific integration with the surrounding services. So for example, with guiding through job submission with Sahara. I mean, really for the workflow we were trying to introduce everything was a little bit too manual. So we basically introduced a new user interface, which we call GG, which was designed with simplicity as a primary goal and essentially offers single click interoperability between the data versus UI and with Horizon. So just as an example of what we're talking about, on the left you can see our process for launching a data processing cluster with Sahara. So we basically just preserve the most important options to the user. But then on the right you can see what the upstream Sahara dashboard looks like. So I mean, we're not trying to reject or replace the Sahara dashboard, but they're an upstream UI trying to cater to every possible use case. But in our system, we're trying to emphasize simplicity and convenience. So Sahara dashboard was not really satisfactory for our use case. So it is worth noting that this final component, GG, which we've introduced, is an entirely optional piece of the cloud data versus ecosystem and it's certainly not the only solution. For this use case, it's possible to imagine this being done with either some other new UI or with a modified form of Horizon. Right, so at this point you can see we're going to go through a quick demo of everything that we've discussed in action. So this is the data versus UI and so from here you can browse data sets. If you click on one, you should be able to see the files in that data set and some citation and other publication info along with the description. So you can see right off the bat you can just, we provide the container name so if you have the ability to just access that directly you can go right into that. Otherwise, the compute button will redirect you to GG and help you with getting set with Sahara. And then if you click on one of the files, again you can see the compute button and if you scroll down we do preserve the original functionality by the original data versus project which is just the direct downloads. You can still do that with this new project as well. Finally, if you do click the compute button you'll be redirected to GG. And so from here you can immediately launch a new cluster or if you already have one you can just use that. So if I click on it you can see that like I was mentioning we've only preserved essentially the most important option which is what software would you like to run the cluster and the size of the cluster. And you do have to provide a name and that's really it. So you can see that the cluster has been launched and if you switch over to horizon you can see that the cluster is now spawning. If you switch back to GG now you're now ready to launch a job. So you can see that the container name containing your selected data set has been pre-populated after you click the compute button. So now to run the job all you have to do is say which cluster you'd like to run the job on which type of job. And in this example we're using word count but eventually we'll have the option to upload the user can upload their own files their own job templates. Finally you do have to provide the input and output if you just put a wild card there that indicates that you want to analyze every file in the data set and then you have to provide them the container where you want to see the output of your job. So you can see now that the job has been successfully submitted and then if you go if we go back to horizon you can see that now first the job will be pending and then it should switch to running pretty quick and then you can observe the status of the job from the Sparkmaster UI you can see if I refresh now the application is now listed under running applications and it's a pretty small data set so pretty soon you can see the output in the container that you specified and you can see the job complete successfully and then you can download your results. So some future considerations for this project we'd like to implement a metadata system so you can know which files are actually computable we've observed a lot of people who publish data sets on the platform like Dataverse often have things that aren't designed to be consumed by the computer engine so like things like a PDF report or even just sample outputs of previous analytic work. We're also looking towards better container permissions part of the goal especially if you came to the talk on Monday is that we're trying to move beyond just simple public data sets I mean not every researcher wants to just have total unrestricted access to what they're working on. We're also looking towards a common identity provider between Dataverse and OpenStack so your Dataverse account can be your OpenStack account and this will make everything more seamless and just help tie everything together. We're also investigating the ability to share your job binaries or other analytic scripts you know paired with a data set directly so for example if you wanted to replicate the results of a paper you could just jump right into that in your computer environment immediately. And then on that same note we're looking into providing a similar workflow that we have for big data that can work for smaller data sets so like that was the idea with the smaller VM running something like R or scikit-learn and then finally we have this sort of more novel idea of a shopping cart when you're browsing the data sets so you can imagine if you have a very popular Dataverse with lots of data sets published you could actually choose to analyze content from multiple data sets so that's something we're thinking about in the future. At this point if you have any questions feel free to come up use the microphone we're happy to answer anything. That was really neat thanks. Thank you. I really liked how simple it was for you to integrate Swift into the existing work that you had. What I was curious about is I noticed that you were doing that inside of Java so what Swift client were you using for that? What library? Client library? The standard Swift I mentioned it in you can go back. The library is called Joss. So that's a lot of people have been using that with pretty good success. So Java Swift and Joss are used interchangeably and I'll show you the actual name of it. Oh sorry I missed that part right there. Yeah, I'm sorry. Great thanks. So I just said when you were running the job before what language was that in? Was that in SQL? Was it in Java? So that was a was it pre-written? Oh sorry that was a Spark application written in Scala. Any more questions? Anything else I can clarify? We have plenty of time. Oh, sometime. Thank you guys. Thank you.