 Hello everybody. Hi there. My name is Atatürk. I'm a research scientist in Boston University and the Massachusetts Open Cloud and today we will introduce you our big data as a service solution that we are building and implementing in the Massachusetts Open Cloud and these deck of slides and this work was a collaboration between multiple entities as you see at the bottom and we have a number of talkers and presenters that that prepare the slides, but I will go over most of them to make it brief for you guys. So we are living in the literally living in the era of big data analytics machine learning artificial intelligence. We are generating enormous amounts of data not only human-generated data, but also machine-generated data and enterprises and organizations are recognizing the value of analyzing this data to improve their performance, to improve their efficiency, to improve their services, to provide more intelligent services to their users, to get some competitive edge and cloud is a great platform for these environments because it provides better cost, better scale, better availability, easier management opportunities. So and we see this pattern of moving of these big data analytics platforms to the cloud and the rising of these public cloud offerings such as AWS Elastic MapReduce, Amazon Public Data Sets, Azure HD Insight. There are these platforms that provide these services and they are getting more and more popular and in this talk we will try to explain to you our efforts of providing such a service in our own cloud in MOC and this service is running on top of OpenStack. Obviously, otherwise we wouldn't be here. We had to build some components on top of OpenStack to be able to provide these services. We had to develop a cloud data set repository. We had to build user-friendly UIs that enables setting up, that eases the setting up of big data frameworks and we had to develop a data center scale caching solutions so that our users can access to the data sets fast and can set up these environments very fast and I will go over these in this talk. Briefly about MOC. The MOC is a collaboration between the academia, industry and government. It's a very unique organization in this sense and we are trying to build a public cloud on top of this open cloud exchange model, which is a noble model. If you had listened to the talks in the first day on Monday, we covered the MOC's essentials. The partners of MOC, academic partners of MOC include the Boston University, Harvard, Northeastern, MIT and UMass. Our core academic industry partners include Intel, Red Hat, Lenova, Brocade to Sigma and Cisco. From the government side, US Air Force and Commonwealth of Massachusetts are providing our funding. We are currently offering IIS services on top of OpenStack. We have a 15 megawatt data center in Western Massachusetts in Holiok. This data center is owned and operated by these five research universities. It's a shared collocation facility by owned by these five universities and one of the core mandates of MOC is offering big data analytics, machine learning and artificial intelligence services, offering services for enabling big data analytics, machine learning and artificial intelligence research and innovation. All right, so when we decided to build this big data as a service solution, we started talking with our potential users. Our initial potential users are mostly coming from our MOC partners, so namely coming from the academia and the industry and the government. And we tried to decide a set of features that we wanted to support in our solution. Initially, we started with a design that is similar to Amazon Public Data Sense and AWS Elastic Map Reduce, where we have a centralized data repository that hosts the data sets and we have a computational platform that supports on demand big data cluster setup. This was our initial starting point. But as we kept talking with our users, potential users, we realized that we had to have more features than that. So one of the requirements that our users wanted from us was being able to easily search data sets. So the AWS Public Data Sets just list the set of data sets we wanted to be able to have advanced searching facilities. We want to have mechanisms to incentivize sharing and collaboration on these data sets. So especially researchers wanted to be able to get citations for the data sets that they upload. We wanted to be able to have support for not only public data sets, but also community-owned data sets. There are certain researchers and state organizations that want to upload their data sets. But they don't want to share it with everybody. For example, researchers in CERN, they want to share it with other physicists, but only the physicists that collaborate within the CERN project. So we wanted to be able to have support for community data sets. Our final, oh no, we had an after investigating the usage patterns of these data sets, we realized that some of these data sets have a weird usage pattern. Practically most of the data sets that we are trying to host are periodically updated, like whether data sets or traffic data sets that are coming from the state or social network data sets. They are collected every day or every hour or every week and the usage patterns of these data sets by the analysts are weird. The more recent the data set, the more the usage is. So if we used default storage mechanisms, we would observe lots of loading balances and networking balances. So we realized that we had to provide some caching feature within our data center. So we started building a caching mechanism for our data center. Finally, since MOC is collocated and placed in a shared data center, we had this requirement that was imposed by our own infrastructure. We had to have mechanisms for pooling and sharing the hardware resources available, provided by the multiple entities that are composing the MOC. So we had this additional requirement that was imposed on us by the MOC's infrastructure. So I will try to visualize this set of requirements on an image so you can understand it better. So we started with OpenStack as our computational platform and Cephas as our centralized data set data lake. For on-demand big data cluster setup, we quickly settled on Sahara because Sahara provides necessary functionality that we need, even though the UIs of Sahara are not as as good as we would like. So we had to do some changes in there, but still we like the functionality provided by the Sahara. So this is what we had and what we started with. So on top of this, we had to build this data set repository solution for incentivizing data set sharing, for making searching of data sets easier, for providing support for not only public data sets, but also for community data sets as well. And also we had to build this data set caching solution so we could provide access to commonly used data sets faster so that on-demand clusters can be set up faster. And also we had to build this repurposing caching based on demand solution due to the requirements of MOC. So I will first start with the data set repository solution. Actually, previous talk was discussing about how this was implemented, but I will quickly go over it. Cloud Dataverse, first we investigated the data set, if there was any data set repository solution provided in OpenStack, and we couldn't find any that catered to the needs, specific needs we have, such as sharing incentive support for community data sets. Generally used solution of providing just public Swift or S3 endpoints did not work for us. So we needed this data set repository solution for the cloud. So after looking at available open-source solutions, we settled on Dataverse, which is an open-source data set repository project and decided to extend it so that it can work with OpenStack, which resulted in the Cloud Dataverse project. So briefly, what is Dataverse? Dataverse is an open-source software platform for building data repositories. It provides incentives for sharing data sets, such as providing document object identifiers for each data set. So you can get credit when you upload a data set. You can get credit for the data set by getting citations in the form of citations. Dataverse also provides mechanisms for controlling who accesses which data set. So when you upload a data set, you can determine which users can access your data set. So it inherently has community data set support. It also has a long-lasting community. It has been, the Dataverse project started 10 years ago within Harvard, but now it's being deployed over more than 20 sites all around the world, and there are more than 500 institutions using it. So especially in the academic community, Dataverse has achieved widespread acceptance. So if you have a paper in Nature or Science, there are the chances that the data set that you use is in one of these Dataverse sites is very high. So it has a community, it is a long-lasting project. That's also what we liked about Dataverse. So within the Cloud Dataverse project, in collaboration with the Harvard Dataverse team, we decided to create a new software for the Cloud, the Post-it Software for the Cloud. Dataverse was initially designed for smaller data sets, and it was using a network file system for storage, and we switched the storage back end to CEP to have a more scalable repository with support for much larger data sets. Dataverse also did not have an application mechanism, and we are now currently building this harvesting solution for replicating data sets in the Cloud. Of course, we also added this compute button next to each data set, so now you don't need to download data sets to be able to compute over them. You can just click on this compute button, and within since the Dataverse storage and Dataverse compute services, the compute services that you provide are within the same data center, you don't need to do any transformation of data over the network. You can quickly start your analysis on the data. So we settled on Cloud Dataverse as our data set repository. We are still building the Cloud Dataverse. It is still an ongoing project. Next, I will go over, actually, we'll talk about our data set caching solution for Fest on-demand access. Hello, everyone. So now I'm going to talk about our caching solution that we developed. Actually, this was started as a research project. So we designed a caching mechanism, which is an SST-based and block-level caching mechanism. It's designed to speed up the big data accesses to the data lake. When we design it, we're actually inspired by today's content delivery network, or CDNs. So CDNs are basically the caching to data on the access side of the network, and they use DNS to connect the client to the closest cache node. So we adopt a similar approach here. So our D3N solution is basically cache, frequently used data sets to prevent bottlenecks that occur on the network. Because today, not all the data center has full bisection bandwidth, and we have a lot of over-subscription happens because of we see different bottlenecks on a different level of the network hierarchy. And we also use a DNS-like lookup servers. We basically distribute those servers across our data center, and these DNS-like lookup servers are used to forward our request to the closest or least loaded caches. And this way our VMs don't have to know or aware of the data center network topology or the state of the caches. Also, as I said, we are using like a block-level granularity to cache the data, to benefit various applications. You can think about Spark, ipeak. Okay, now in this picture you see a typical data center architecture, right? We have the racks at the bottom, which contains our compute nodes, running big data applications, and on the top you see a hierarchical network topology. And when we move up to this hierarchy, we see usually more over-subscription and congested links. So what we have done is, so D3N is at the following components in this architecture. First, we add these anycast lookup servers on each rack. And as I said before, these servers are basically direct the request from VMs to the closest cache nodes. The second component we add is the cache servers or cache node, as I said. So these servers we installed on each rack, and these servers are equipped with high-speed SSDs basically. And then we implement multiple level of cache on top of these cache servers. Now I'm going to talk about different level of each of these individual level of caches. So first we have the L1 caches. These are implemented per cache servers, and the goal of the L1 cache is basically reduced accesses to the top of racks switches. And then we have L2 cache pools. These are like you can think about like a distributed cache implemented within the cluster. And the goal of this is like reduced to accesses or traffic generated to the cluster switches. And of course, when you move up to hierarchy, you can add more layer of caches. But what's happening today is usually you don't need a cache for each level. Usually like two or three level cache probably will eliminate most of the bottlenecks that you see in your data center. And since all of these layers are implemented within the same cache servers, we are very flexible to change their size based on the needs. And so this was our design and we implemented in the message set open cloud. So when we implemented, we only implement the second level cache and the first level cache because that's what we need in our architecture. And I'm just going to briefly explain what we have done. So we implemented our caching solution within the Radov Gateway. And we only implement add modify like 2,500 of lines. And so when we forward the request from one level cache to second level cache, for example, you have a miss on the first level cache, then you have to forward your request to the second level cache. We are using consistent hash algorithm to compute the location for each block. And as I mentioned previously, like iteratively our solution forwards the request from clients of the cache. And now I'm going to show you some very excited results. So we run a simple curl benchmark to see the performance of our cache implementation. And in the x-axis, you see we scale the number of concurrent row nodes who are making the request. And at the y-axis, you see the aggregated throughput. And so the light purple bar, which is the RGW, is the unmodified original Radov Gateway code. So we base in this scenario, our virtual machines basically read the data from the data lake. And the dark purple bar represents the D3N L1 hit performance. In here, our virtual machines read the data from RAC Local L1 cache. And as you see, D3N can provide up to five time performance improvements comparing to vanilla Radov Gateway. And we can saturate the read speed of SSD. So currently, we are upstreaming D3N as an experimental feature to the CIF. And we are deploying this caching architecture in the mass open clouds. And we are right now evaluating the D3N under the variety of workloads. So please talk to us if you have any example data sets that we can test. All right, I will talk briefly about the existing hardware infrastructure we have in MOC for serving the speak data as a service solution. We are sharing the compute resources of the engage one HPC cluster. So this is a production HPC cluster deployed in the shared data center that I mentioned, the MGHPCC, which hosts around 300 servers distributed across 18 racks. We extended this cluster by adding 10-gig nicks to each server, adding a top-of-rack sewage to each rack, and adding a cache server, as over mentioned, to each rack with three terabyte Intel SSDs and 40-gig uplinks. We connected these top-of-rack switches in a bifurcated ring topology. Our bisectional bandwidth of our network is around one terabits per second. And between any two racks, we have three distinct paths. Our storage solution is based on SEF, as I mentioned. Current deployment, current storage, current data-like solution is running on top of 90 spindles assigned as 90 OSDs. And our current storage capacity is around 326 terabytes. And this is served by 10 servers distributed within the network. So, Nitryan is our dataset caching solution, so this slide is misplaced. So this is a brief picture of how our physical hardware architecture looks like. As I mentioned, our storage mechanism is SEF. We have a 326 terabyte capacity as of now, but we are trying to expand to 20 petabytes. We have one more additional server, an additional rack, in addition to the HPC cluster that serves our static OpenStack deployment. And there we are running OpenStack and we are running the cloud data-versus repository, and we are using Sahara to spin up on demand clusters. And as the demand on this OpenStack deployment increases, we expand and borrow resources, borrow hardware, borrow servers from the production grade running HPC cluster. So I will explain how we do it briefly. So I will explain how we do this repurposing scaling operation. So as I mentioned, we are running all of MOCs running on top of MGHPCC, the shared data center. This data center is owned by these five research universities, and the amount of resources available, hardware resources available to MOC are actually meager. We are running on top of a few hundred servers normally, but as the demand increases, we have agreements with all of these groups, and as the demand increases to our cloud, we can borrow hardware resources from the system, so we can expand to these other clusters. But when we don't have demand, we have to give back these resources. So for us, what is critical is to be able to borrow these resources and give them back without creating any disruptions in both our environment and any disruption in their environment, and also be able to do this elastic borrowing and giving away nodes as fast as possible so that we do not reduce the utilization of the overall data center. And to achieve this, we developed two levels of services. The first service that we developed is the hardware isolation layer. This is a shimming layer that we run only in the network layer that provides us elastic resource allocation and network isolation services. And we did this service such that it's compatible with any provisioning solution. So it just provides you isolation. And for provisioning, you can use any provisioning tool you have. So if you have an HPC environment, you can use your own provisioning tool. If you have an open stack environment, if you are different groups that are running different HPC environments and you would like to use different provisioning solutions, you can use your own provisioning solution. And still, to be able to do this provisioning very fast from one environment to the other, we also developed another provisioning solution as well, which we call the bare-metal image management system, or bare-metal imaging system. BMI does distance provisioning and network booting from pre-installed images. And through this, by this, it can provide rapid provisioning and it can provide the benefits of image management that is available to virtual machine systems to bare-metal systems. So that's very beneficial to us. We take an environment and use it and give it back in the state that we had taken away. So this is very beneficial to our collaborators as well. And also it reduces the amount of reprovisioning significantly. So a standard provisioning tool such as Formon requires around 25 minutes for provisioning an environment. With BMI, we can provision an environment under six minutes. So our repurposing scaling solution is based on a combination of Hill and BMI. So to build this big data as a service solution, we had to develop a number of components and had to make a number of changes into OpenStack. So I will quickly go over the set of requirements we had and the changes we made and the takeaways we kind of want to give to the OpenStack community. We had to build an asset repository solution. We settled on the Cloud Dataverse. Our takeaway is current Cloud Dataverse is running as an application on top of OpenStack. And it would be great if OpenStack internalized Cloud Dataverse because OpenStack needs a data set repository solution. We could not be the only team that needs a data set repository solution such as the one that needs the features that Cloud Dataverse can offer. We developed a data center scale caching solution, D3N. And we are working in upstreaming this into CEPH. So this should be available to researchers. But our takeaway for the OpenStack community is if OpenStack could integrate with the caching solutions, it could offer more intelligent decision capabilities to its users. For example, you could use if OpenStack is aware of the data placement within the cache, you can decide on VM placement or you can decide on data routing. You can make intelligent data routing decisions. Finally, we had to develop a number of user-friendly UIs. The previous talk was talking about it. On Monday, we talked about the DG project as well. Unfortunately, the current UI for Sahara is very complex, especially in the cluster setup initiation phase. And most of our users, early users, I would say, are complaining about it. We definitely needed a few click cleaner UIs and we had to develop those themselves. So our takeaway is if either Sahara UI team could talk with our team to internalize those solutions or if you could come to a general consensus and have to do this in a few clicks, it would be useful. All right, to sum up, we developed a big data as a service solution at MOC. We are providing a data set repository to host very large data sets. We are providing mechanisms to collaborate on these data sets. And we are providing a computational platform to analyze these data sets. And we are providing caching mechanisms so you don't need to actually hug your big data clusters. You can give them away and come back to them and you will still be able to access your data very fast. All of the big data as a service project was built as a collaboration between multiple entities, multiple groups that are partnering within the MOC. Our networking solution initially came from Brocade and we invented on it a lot. Our storage and caching solution initially came from Lenovo and Intel and the design for the CDN like caching mechanism initially came from Intel. Even though the implementation was done in the academy in Boston University in the northeastern and the deployment was done by MOC engineers, the whole effort was a collaboration between multiple entities. And we believe that this is a good example of how the open cloud model that MOC stands for can benefit research and innovation in the cloud space because both industry, academy and the government collaborated within this project to provide this end-to-end solution. All right, thank you. If you have questions, we can take them now. Yeah, sure. So I noticed that you talked a little bit about your high performance computing and I noticed that you have a lot of cluster networks. Are you guys using dark fiber or Internet2 like HPC networks? And if you are, is that beneficial to the solution you have? If not, are you using standard circuits that you get for like point-of-presence circuits that you get from your current carriers? Right. As of now, we are within the data center even though the HPC system has its own network. We are not using the HPC systems network. We developed our own network within the cluster not to disturb their communications. Any more questions? Yeah. What kind of caching algorithm do you use? You know, LRU, sloth level, et cetera. Yeah, we have it. Can I program it? So right now, we are using LRU but in the future we have some ideas like segmented LRU. And we don't, those are going to be coming up soon but currently it's LRU. So we are using the default, it's a configurable parameter for RADOS Gateway. We are using the default block size which is four megabytes. If you are interested in both for the caching solution and Dataverse and the other tools, we have both, we just recently made academic submission so we have more detailed technical documents that we can share so if you are interested, just reach to us and we can share them with you. And also, if you have data sets or if you have workloads or if you want to try our caching solution, for example, if you are interested in trying it, right now we are only exposing this architecture to a set of trusted users, I would say. This is still in beta, it's not open to public yet but reach out to us and we can definitely expose it to you and we can try to collaborate and to improve it. All of the projects that I mentioned, the Cloud Dataverse, the D3N even and the DG project, they are all open source projects. They have their GitHub repositories, the HAL, BMI, they are all open source projects. They have their own GitHub repositories. If you look at Mess Open Cloud website, you will find out the links and please reach out to us, we would like to get more collaborators. All right?