 Good morning, everyone. Hi, my name is Joe Arnold. And I'm the CEO of Swiftstack. I'm joined here with my DDA contest. Good morning, everyone. And you're all in for a treat. So we're going to walk through a use case of what Georgia Tech is doing with Swift. So again, I'm Joe Arnold. And I'm the CEO at Swiftstack. And what we do at Swiftstack is we build OpenStack Swift. We have a lot of developers who are on the core team of Swift. And we've built a product around Swift. And we help enable our customers to deploy, scale their OpenStack Swift environments. And I'm Didier Conti. I'm the director of IT for the College of Engineering at Georgia Tech. And as you know, Georgia Tech is the university a few months from here. All right, OpenStack and Swift. What Swift is is an object storage system. And what that means is that you put data in and out of the system via an object API. And in Swift case, what that means is it's over HTTP. So you get, you post, you put data into it. It's not a file system. And so that's the way you interface with it. The benefit of an object storage system is that there's some properties of scale out. There's a multi-tenancy built in, massively concurrent, runs on Linux in Swift's case. And it can be powered by inexpensive, standard-based hardware. And the origin of object storage really came from the first initial public clouds, at least in its current iteration. And Amazon introduced a product called S3, if you're familiar with that world. And what Swift is is a open source project where you can run your own version of an object storage in your own data center. And it powers, it came from Rackspace, and they built that to power their competitor to S3, called Rackspace Cloud Files. And then they released that into the open source as part of OpenStack. And that was a few years later. And here we are today, and that project has continued to power lots of public clouds. What, how Swift does data placement is it distributes data as what we call as unique as possible. So whether that's a small environment or a large environment, data gets placed further apart, which makes it really tolerant of failures. And that's exactly what we're going to walk through with this use case is how that gets used. And so the objects are distributed around, and you can get access to them via an object API, which is over HTTP. But in this context, in a lot of the use case that we're talking about here is having an application which speaks SIFs and NFS so that existing workflows, existing applications have a way to get access through that data through those protocols. OK, thanks, Joe. So George Atteck, we are universities located a few months from here. We have a lot of students in engineering, and we do cool stuff. Before OpenStack, before the public, private, hybrid cloud, there were some things that we called our federated condominium systems. Here's an example. This is our forms oriented toward VDI and applications where research groups, IT departments, can come and bring servers. And in exchange, we have a platform as a service where our IT groups can publish applications toward our students as part of virtual labs. This is another form that we've been using around HPC where faculty research groups can bring and bring their servers in exchange. These servers get integrated into HPC infrastructures, and our faculty can run large-scale applications. So these are kind of our pre-cloud environment. Our faculty are research groups that love to compute things, that love to do simulations. But you know what? They even love more their research data. And it's not only the users of our VDI form, of HPC forms, it's all our faculty. They love to acquire, to create, to exchange, to receive research data. Have you ever seen a grad student on a Friday afternoon receiving a box from FedEx or UPS full of USB drive with large data sets? They're really excited. You know, some of our faculty do fun stuff. For example, we have this van driving across the highways in Atlanta with a bunch of sensors. It looks like some kind of Google car. And what they do is they record data around the condition of the road, of the pavement, of the bridge. And they record a lot of data. For example, for every mouse they drive, they acquire 2.2 gigabit of data per mile. Once they come back home and upload the data and start to do the post-processing, they generate an additional 1.2 gigabyte of data per mile. So far, they have 16 million of files on a simple white box server. And this is not very efficient. Here's another example where instead of creating data, we acquire data. In Georgia, we had with Interstate 85 a project regarding converting a natural v-lane, a high-occupancy lane, into a high-occupancy toll lane to try to have the congestion relief for I-85. And as part of this conversion, we have research groups with trying to study the impact. What they do, they have access to a direct fiber network feed from the Georgia Department of Transportation and have access to all of the cameras recording the videos. So far, they have recorded 400 terabytes of data. And guess what? All of the servers, all of the file servers I thought they would be able to use and have enough storage, they are full. They have to start using USB drive, LatinWrite to make space. They have to try to skater, scavenge, free space wherever they can. In addition of acquiring, receiving a change in research data, we have another small issue. Recently, there has been a mind-date issue by the White House regarding data creations. That means that once a research project is done, some of the research data generated or created have to be created after the project is done so that over researchers across the country can access this data and start to do research. That means that now once the funding of the project is done, we have to figure out where we're going to be storing this data for the long term. We're talking about 25, 30, 50 years. So you told us you have all of this research data. Where do you store them? Well, first of all, we don't know how much research data we have. We know we have a lot. Just for HPC, we have two petabytes of storage. That's a lot, but we probably have more petabytes. In addition, we don't know where to store them using the gold-plating enterprise storage, or typical storage vendors are trying to sell us. Yes, I could buy if I had an infinite amount of money, basically the scalable, highly available NAS system. Well, can't afford it. Backup is becoming a problem. Guess what? When you have a cheap white box server with 16 million files, it takes a lot of time to backup these 16 million files, especially if you have just a bunch of data drives. But more importantly, our single biggest challenge, when it comes down to storing research data, is the USB drive. This is the new storage on demand. You run out of storage, you just go purchase an additional USB drive. Is that efficient? No, but it's our vision of storage on demand. Here's an example. The story says that sometimes important research data might be stored on not so reliable storage systems. There is a urban legend at Tech about an unconfirmed report of a cheap NFS file server that may have been in action back in 2006. A research group reportedly bought a cheap, refurbished desktop systems, added a bunch of USB ports, attached 13 plus USB drive, exported each USB drive as individual NFS monpoint for a research project. Now, what kind of redundancy do you have with this? Are you going to be running a red array on top of USB drives? Backup? Could you repeat the question, please? Anyway, so you guys are pulling that. We have some challenges when you come to research data. So we also have challenges when it comes on to compute. Our magic answer to all our problems meet Vapor, the hybrid cloud. What Vapor is is our vision of bringing all of the various projects we have. We have people that are studying in question of computing, OpenStack. We have all of these large firms, whether it be an HPC firm or a hypervisor firm for VDI with a lot of unused host powers. And so we're starting to have a project where various academic departments, such as Tech, are coming together and say, we are going to federate all our forms, our projects, and we're going to be probably using OpenStack as one of the federation technology. And this cloud, both for compute and for storage, is basically going to be in support of instructional research. It's going to be distributed and federated, distributed across the campus, but also probably distributed beyond. We know we're going to have to integrate in our hybrid cloud in our public providers like AWS, Azure, Rackspace. One more thing, we need to be able to iterate very quickly. We have to be able to go fast, fast, fast, like someone said during the keynote, to adopt the new technology. What are we going to be doing with this hybrid cloud? Well, you have your typical workload. We're going to have ephemeral computing, a lot of pet computing. We have a lot of graduate students still needing some kind of dedicated, permanent, personal workstations. What a hybrid cloud will let us do is be able to do scale up in terms of memory CPUs. And for social service, if you need some small compute, which doesn't require the high speed and interconnect of HPC, platform as a service. OK, I just need a bunch of MySQL servers, or how do clusters to do some research? This is the overall visions. I'm not going to go into details for each layer. Yes, there is no open stack components listed. This is on purpose. This is a diagram we use when we brainstorm on what technology we're going to be using. For this talk, I'm going to focus specifically on the data storage layer. This is where we have invested a lot of time. So this data storage layer, its purpose is to store a large portion of our research data. And frankly speaking, this is a Swift stack talk. But we know that will probably be a need to use additional technology, additional vendors. Some of the requirements, it has to be distributed and resilient. Why? Well, sometimes we have building wing of line during a weekend for electrical maintenance. Sometimes we have a fiber optic cable being ruptured by a contractor like it happened a few weeks ago. We want to limit vendor dependency. I'm sorry. We love our storage vendors, but sometimes the relationship with the storage vendors goes south. And we don't want to be taken left with a bunch of data we cannot use. By wing open source, we feel a little bit safer. We want to be able to leverage de facto standard, S3 Swift, multiple entry points. Object is great, but we want to be able to access the data via over means. And we want to have a flexible design. If we are going to be using this data storage layer, we don't want to have to migrate the data in three, four, or five years. I cannot tell to our faculty, hey, sorry, we're close for business for 12 months. We have to migrate three, five petabytes of data to another storage platform. Not going to happen. So we have to be able to have a storage layer which is going to be for the long term. In summary, this data storage layer is going to be supporting several services. Research data for active projects, long term storage of the data for data creations, data repositories if we wanted to share data with our institution. OK, so now it's a good part. Why Swift? Why did we go initially with Swift with Stack? Well, Swift is open source. So that limits, obviously, the vendor lock-in. It's a turnkey approach when we use Swift Stack. I think it took us 10, 15 minutes to get the first version of the cluster going in. So a very quick deployment. We like the fact that there is a lot more activity around Swift, and so there is a growing ecosystem. We're starting to see more and more middleware or applications being able to support natively Swift as a back end. This is great. The system is robust. We love the fact that we can rely on replications. You cannot tell how many times we've been burned by a radar ray going air wall and how much data we've lost and had to restore from tapes. So press this right so far. What we don't like? Well, it is object storage, meaning it's not easy to use when you're used to use file systems. There is an appeal battle in terms of adoption. And it's a fairly young project product. So there is a bit of risk for us. Some of the research projects that are starting to adopt Swift, we've talked about a couple of projects, obviously. We are targeting for these adoptions in terms of transportation-based research projects. We also have a lot of projects going on in aerospace engineering and biomedical engineering that we are targeting. How do we approach a research group and say, hey, yes, OK, you've run out of space on your file servers. I have these wonderful product services. It's called Swift. It's going to solve all your problems. Researcher, OK, yes. Give me a drive point. Give me an FS share so we can tap. What we're doing is we're trying to tell the research groups, look, you have a lot of research needs. But for the purpose of the research, it would be really great if you learn how to talk to Swift using the API. Because while you have an upfront investment in terms of time, you'll be able to benefit down the road in terms of indexing, metadata, scalability, large-scale performance, being able to run zero VM, being able to run Hadoop. And so we spend a lot of time talking to our research groups, trying to educate them. We even give them free incentive. Hey, if you use Swift, we'll give you 10, 20 terabytes as a freebie for you to start to use our system. Some of them do, and some of them don't. And they say, I really, really, really need a way to talk to this wonderful system using Swift or NFS. Why? We said it already. Object storage is difficult. But there is also a lot of workflow, which is based on files. Our students are always, you know, are all the time using Linux applications, Windows applications to be able to run post-processing jobs, to be able to run simulations. Latency speed can be also an issue when we use a file system gateway. So in the case where the research group says, I need to be able to use these wonderful storage systems via a sysnfs access method, we are starting to deploy, basically, the Swift gateway. We also are looking for a solution that will give us a GPFS access. And we're thinking about, you know, over solutions. What about if we write the data to a high-speed NAS and we use some kind of storage abstraction technology to move the data back to Swift? And so the design goals that we had for the file system gateway was really to try to drive no data lock-in. And some of the constraints of the existing gateway products that will represent a file system and then back into an object storage system is that the only way to get the data in and out is through that gateway. And that's because they'll take that file and they'll chop it up in a proprietary way and then put that back into the object storage system. So the idea behind the file system gateway application that we've built is to allow data to come in via a file and then also come out via an object API. And that means that in this context, where if a researcher is building an application to put data into it, they're often doing that via an application. So object APIs are the best way to put that data in that way. But then if they get to a workstation and they have a workflow, then they can get that same object out of the system via a file system. And so it just gives the best of both worlds in a way for accessing the data. The other issue sometimes with the file system gateways is that they're built with using a single public cloud account in mind. And when you go to a private cloud deployment, that's usually not the case. Every user on the system has an account. And you're keeping track and you're accounting and you're gaining access to users based on some other centralized authentication system. So we want to make sure that that also plugs back into that authentication service so users can have their own shares and manage them separately from others. Earlier I talked about this research data creation problem. We've talked about a lot in this presentation so far about how we're going to be using or how we're using SWIFT to store active research data for active projects. But what is our strategy to also handle the research data creation problem, which is a different problem at sea? So we have a project which is led by the Georgia Tech Library, which basically is taking an existing repository, which is dSpace, and migrating this repository to a new technology called the Fedora Repository and Infrastructure framework with something else called Hydra, which is going to be used as a head, as the access mechanism by which you access the repository. Right now, I've talked with the lead developer for the project. They are deploying Fedora 4 for those interested as the Fedora 4 that will connect to SWIFT. And in order to do this Fedora to SWIFT connection, they are using the JBOS model shape and the InfiniSpan storage subsystem for those who are familiar with the technology. Initially, and that kind of goes to the ecosystem development, the InfiniSpan sub-storage connection is going to be using the SWIFT S3 emulation layer in order to access the data. There are some work done on testing whether or not they'll be able to use the Rackspace Cloud Files API to do the connection. So how have we implemented SWIFT across Georgia Tech campus? Well, remember, we're talking about a federated and distributed academic cloud. So we have a bunch of zone in the SWIFT clusters, but each zone is managed and owned by a different academic department. So in a sense, the School of Industrial Engineering is operating one zone. So central IT services for the College of Engineering is operating another zone. So it's distributed management. It also means that no one can own the cloud in certain fashion. We expect to bring additional zones. So far, we have three zones. We believe that more additional departments take interest into this project and add resources. We'll grow probably to five, six zones. And on our roadmap is to try to leverage some of the peering agreement with some of our, you know, not peering agreement, but more hosting agreement with some of our peer institutions to be able to have an additional regions located in a different state. And replications will take place using some of the internet to high-speed networking we have access to. What hardware are we using? Well, like everyone else, we are using a bunch of white box servers. One thing we've tested up front, not necessarily the greatest way to get the best performance is we are leveraging the fact that SWIFT doesn't care about the hardware configuration as much. So SWIFT is very tolerant of having an heterogeneous configuration. So some zones have more storage nodes than others. Some zones are using primarily 24-bit chassis. Some zones are using 12-bit chassis. We have a mix of enterprise and consumer grade. We always have this debate within our IT community. Some departments prefer to use enterprise grade, I mean, drive. Some departments prefer to use consumer grade drives. All our storage nodes, at least 90% of them are connected using 10 gigabits. We are using the LSI SAS Red adapter. One thing, if you are using this LSI 9211 8i SAS adapter, make sure you flash it to use the integrated target firmware. Most of this card ships with the integrated red, and that's not necessarily the best use case for SWIFT. Bunch of SSDs for the account ring containers, typically 60 gig to 120 gig for configuration. Our biggest challenge from the hardware configuration is figuring out what memory to terabyte ratio we want to use. For maximum performance, recommendation is to use 1 gigabit of memory for each terabyte of this space you have. That can become very pricey if you are using a configuration with 24 or 46 drives and using four terabyte drives. Our management approach is distributed. So again, each zone is managed by an individual team which shares administrative responsibility. One thing we are finding out is that there is a lack, at least with the SWIFT controllers, in terms of permission delegations. The granularity is not there yet. And that means that if we want to let an IT group be responsible for the management of the storage nodes, they in this sense have access to all of the storage nodes. There is not a way to restrict a specific group of people to only manage one specific set of storage nodes. So here's a request for the roadmap. Finally, our student workers are a great resource. And we use them to replace their drives. This is great when they know which one to replace or when we make sure that they don't pull a working drive. The nice thing is that before, the student goes to a storage box, pulls the wrong drive. You have now a 24-hour rebuild time with a 6-hour ride. Here, they pull the wrong drive. We have three replicas. And the system will just rebalance and we'll just correct the error, so it's pretty good. We make a lot of usage of the integration between SWIFT stack and LDAP. We need to be able to have our users being able to leverage our credential on campus. Initially, we were considering using active directory integration. We took a serious look at it and decided, no, we better to wait. So we waited a few more months so that the agent knows project until SWIFT stack had a strong implementation of LDAP. The code is great. It took us literally five minutes to do the integration just using the UI, having a service account for the binding. And here we go. What we appreciate is that overall, the LDAP integration seems to be fairly robust. We have seen in the past lots of products not being able to play nicely with a large-scale LDAP directory. Ours has around 300,000 accounts, 2 million entries for having all of the people informations because of all of the students that haven't rolled or applied that tech. So far, so good. One thing, right now, access to the cluster is not restricted. It means that anyone, any student with a valid LDAP credentials can use our SWIFT cluster. So please don't go ahead and don't tweet this information towards students. I don't want to find out coming back to campus that I've run out of space. People typically ask us, OK, so what is a financial model? How do you plan to pay for this? Are you going to have a TCO, an ROI? Are you going to do charge back? And the answer is no. I'm sorry, if you are expecting a bunch of financial data, you're not going to find this in this slide. Our number one goal is to ensure that a vast amount of research data reach this vapor hybrid cloud to reach the data layer. So we have to keep it simple for our research groups. One way to do that is, first of all, limit the recurring cost. Our research groups, because of the way research is done and funded, don't like recurring costs. So we're trying to absorb the recurring costs, like licensing, or things like that, at a central level. And instead, we try to engage the research groups, the departments, into funding the infrastructure. So that means, for example, developing different models around bring your own. Bring your own zone. If a new school, a new department wants to take part into the project, bring your own rack in your own several rooms. Just more capacity for us. If a research group wants to buy your server full of drives and use it, fine. Just let's do the ingest of the server. Bring your own drive. And so we're not going to have any kind of fancy financial model using, for example, a Bitcoin or things like that. We're going to keep it simple. Ravors and talking, hard cash with our research groups, we're going to say, buy us drives. And so we see that, in the future, this project is going to be focused on using hard drives as a form of currency. I don't care how much our research groups are paying for the hard drives, I just want the drives. So research groups A, comes and say, I have four terabyte of data to store. Fine, buy me three times four terabyte drives. And it should be cheaper than buying three USB drives by the way. OK, that sounds like a good idea. But drives have a finite amount of life expectancy. What happened when these drives start to fail? How are you going to replace the drives? Well, we're kind of taking a bet here. We're betting that once the drive starts to die in three, five years, new drives will be bought by other research groups and start to replace a failed drive, except that these new drives will be probably eight, 12 terabytes. So if there is any drive manufacturers into the audience, at least ensures that in three, five years, we have 12 TB helium drives or something equivalent. OK, it's alive. We have clusters. It's been running for at least a good year. What's next for us? Well, we plan to implement quota. We cannot let the systems available for anyone to consume. It's like leaving free pizzas after a conference on campus. People find the pizzas and eat it, and it's gone, right? So we're going to be using quotas, probably container-based quotas. We can assign quotas on a per-project basis. Lesson learned, we did a quick implementation. The implementations were very successful because of Swift Stack, and it's easy to deploy. But we have to go back and do a little bit of re-architecture. So we have to strengthen the proxy layer so we can offer good performance to our end users. So that's going to be our summer project. Whether or not we're going to be using proxies, we've dedicated load balancers, is to be debated. Earlier, Joe talked about the Swift Stack file system gateway, and its benefit. And in our case, we know that we may have to use something else, some of these medieval marriage type solutions, so that we can have higher level of speed to our end users. If anyone is interested into developing a GPFS gateway, that would be great. Why? Because we have a need with our high-performance computing form to enable our students to use direct GPFS access to access and retrieve data. There is some interest around that. In our futures, a lot of evangelizations. We see that our number one task is going to go and talk to our end users and educate them on modifying their workflow so they can basically directly use a Swift API to store the data. We have to find a way to wind them off using always file system access. And so one thing we are looking forward is the development of the Swift Stack file system gateway so that you can have this unified access, whether you come via the gateway or you come via the object store to be able to see the same data. Thanks, Dya. So what we're working on with at Swift Stack, fundamentally, it's open stack Swift. So we've been doing lots of contributions to the project. We have project technical leads. We have a lot of the core developers on the project. We run a community test cluster so that new functionality has a place to be integrated and tested. And what we've built is a few things. We have a deployment model around how to get up and running both from a hardware perspective and from a deployment perspective. We can integrate that with, as Dya mentioned, things like LDAP and other authentication systems and monitoring systems so it can integrate into that environment. And then we have scaling. And we talk about adding new drives into the system, being able to do that in different parts of different data centers and be able to fold that capacity and gracefully is there's a lot of orchestration involved in order for that to happen smoothly. So those are some of the things we do and we support open stack Swift. So I believe we have about eight minutes left if anyone has any questions. So thank you. Thank you. Question on your zones. You mentioned you have heterogeneous equipment. Are the zones still all the same capacity size? And if not, how do you address those issues with? No, so you're correct. Right now we have an imbalance. And I think some of the zones don't have the same size. I think one zone in particular is short on space. How we address the issue, we actually rely on Swift to try to apply always the rules of keeping the data as far apart. One of the things we're trying to address is even if one zone doesn't have the same amount of nodes, we're trying to compensate that by having bigger drive on to inside the storage node. So in principle, how this works is if you have a zone that's uneven. So if you have a tiny zone, a big zone, and another tiny zone, that big zone would have maybe more copy, we might have two copies of a given file in that particular zone, but placed on different equipment or different drives depending on how size the deployment is. Would those copies in the bigger zone occur only after the smaller ones were saturated or is the ring adjusted? The ring will, so how it works is there's a partition space that gets mapped across the entire cluster. And so on average, they would fill up equally well by disk size basically. So over the total capacity, on a percentage full basis, that would be maintained across the different zones. So that's how it works. Thank you. Yeah, any other questions? All right, thank you very much. Thank you everyone.