 Thanks for coming to my talk about OpenStack Swift. My name is Christian Schweder. I'm a principal software engineer working at Red Hat. You might have expected someone else. Actually, Mahati Fremantle, who should have given this talk, wasn't able to come, but she asked me to do this talk. So here I am. I started working on storage-related systems about 10, 15 years ago. I started with DRBD, then later on HPC computing on Luster file systems, GPFS, and later on OpenStack Swift. So OpenStack Swift, I started with that about four years ago. And right now, I'm working with a core of US team on the upstream project within Red Hat. So that's it. I'd like to start with a small introduction to Swift today. And what Swift is about and where does it come from and how to use it. So Swift itself, and well, that's the history of Swift, actually, where the name comes from, is a bird. It's an Alpine Swift. Many open source projects have an animal as their logo, Swift 2. And it's an Alpine Swift. An Alpine Swift is a bird that stays up within the air for months, up to seven months, eating in the air, drinking in the air, and normally doesn't get down to earth. And the same mechanism basically applies to OpenStack Swift. At least that's what we're trying to aim for. So for example, even if parts of your cluster are down, that you're still able to serve data to your users, that you're still able to store new data, and even to upgrade your cluster while being in production group by group of nodes. OpenStack Swift is running in production for more than six years about now. It was started at Rackspace, at the CloudFiles project back then. And in 2010, Rackspace, together with NASA, decided to found the OpenStack Foundation. And that's where it changed the name from the CloudFiles project to OpenStack Swift. And as of today, as you know, or many of you know, the OpenStack project is, besides the Linux kernel, one of the largest OpenStack projects with many thousand developers across the globe, hundreds of companies involved. And the same applies to Swift as a sub-project within the OpenStack EcoSpace. And we have contributors from basically all major enterprise companies. So before talking or going into detail with Swift itself, let's have a short look at object storage. So what is object storage about? So a traditional storage system, for example, file systems or block storage, are basically built to store structured data for you. Let's think of a database server. A database server needs, for example, some files in a nested directory structure. It normally needs a POSIX file system. It needs some locking mechanisms to operate well. And it makes it a little bit difficult to scale if, let's say, you have a very large application, that's probably not a good choice to actually store large binary block data within your database application. It's especially hard to scale if you're going to the petabyte scale. If you ever worked in the HPC space, for example, with Lustre, with GPFS, you know it's quite expensive hardware that you have to use to get, well, scaling up to the petabyte scale. It's also about private access. Normally, for example, you mount a file system to your server or attach a block device to your storage server, and it's not accessible, well, not easy accessible at least, from the outside. So if you have a user with a small iPhone, you want to access a video or photo, you need to go from server and that attaches or that retrieves data from the disks, actually. So, but if you think about this and, for example, let's say you have a video photo sharing website, it really requires that you have a POSIX file system to store all your video content, the photo content. At the end of the day, I'm just interested as a user, where is the data located, how can I retrieve it? And if you take an URL, for example, using an HTTP REST API interface, I'm not interested in which block storage this one is stored on, right? And it might be much easier if you don't need to scale out a file system or large block storage solutions. So that's where object storage comes into play. It's actually best suited for unstructured data. So a video, a photo, maybe even the backup of your database. And traditionally, object storage systems, the most well-known is probably Amazon S3, is about multi-tenancy so that you have many users, each user has its own working space where its data is stored in, and architecture that is basically trying to share nothing. So if part of your cluster fails, that you're still able to retrieve data back and store new data. And we're aiming for very highly available, scalable and durable systems. So in many cases, this is achieved by storing multiple copies of your replicas. And it's also about running on commodity hardware. You don't want to run this normally on expensive hardware using fiber channel systems or other expensive hardware. You want to run it on cheap hardware in a reliable way. So where is it used and how is it used, actually? So as I said, REST API based systems make it very easy to access them by HTTP cards, and that opens it to a wide range of application, actually, storing large sets of video files, photos, any other blob data that you have. And, but it's also quite useful for storing reliable backups. If you're a large enterprise company or a large company at all, and you have some requirements to store, for example, backups off-site, then deploying an object storage solution that replicates data across multiple data centers within your organization might be a good way to actually make it possible to restore data even if some of your data centers is actually down. And another use case is for archiving large sets of unstructured data analytics, for example, or scientific sector, climate data, medical data, data that needs to be actually stored for a long time, but it's not that likely that it gets accessed very often, and that's a very good use case, actually, for object storage. So Swift itself is basically deployed at many large enterprise companies, and many of these deployments are operating on a scale of multiple of petabytes. The biggest one, actually, it's still REC Space, the original founders of the project are running more than 100 petabyte system at the moment. I think the second biggest one is OVH, a French hosting provider, and all of these numbers are basically public. If you go to the OpenStack Summit website, view a few of the video files from recent talks, people were introducing their clusters and telling you a little bit about their size. So, okay, people are storing a lot of data inside. Swift, for example, how do they do that? Or how do you do that? Because you don't have a POSIX file system and there's no central database or any central service within OpenStack Swift, we're using a very flat name space. You only have accounts, for example, U-Account, and containers and object names. So there is no support for nested directories, for example. So if you go to U-Account in an OpenStack Swift cluster, you get basically a collection of containers. And if you want to, for example, view the list of containers, you just send a GET request to a specific URL and Swift returns the list of containers to you. If you want to create a new container, you just send a put request to Swift and we'll put a new container up for you. And containers itself are collections of objects. So, same mechanism again, if you send a GET request, you get a listing of your objects in this specific container and if you want to upload an object, you're sending out a put request and a GET request to retrieve that object back later on. So what does actually happen if you store data or if you try to store data within Swift? So when you send a put request, normally you are first talking or your request gets accepted by a load balancer that it runs in front of Swift. So in this case, we have only a single proxy shown here. The proxy is basically the nodes that every user is talking to. It's the gateway into your cluster, for example. And normally you have a bunch of proxy servers, tens or maybe even hundreds if you have a very large cluster. So your load balancer forwards your request to one of these proxy servers. And let's assume you have set up your cluster for a replicated way to store data. In this case, we're running with three replicas. The Swift proxy server will send your object replicated three times to three different backend storage servers. And each of these storage servers will only reply to the proxy server with a yes, okay, 200, okay code. I started on this if everything was fine. Swift, the Swift proxy server itself requires that there's a quorum of requests that succeeded. So only in this case, only if two of your three servers actually responded okay, I start all the data on disk. The user will also, only then the user will get a 200, okay back stating that this request and this object was stored fine. There are a few more servers running on the storage nodes. The object server is taking most of the load actually, but you also have container servers and account servers which are storing the listings of objects and the containers for you. Typically, the container and account servers are storing the data on SSDs because it's only a fraction of the data that actually gets stored there. And the base amount of storage is actually stored well, cheap disks hopefully by your object service. Okay, so we start the object three times for example. How does Swift ensure that the data is stored durable and that you are able to access or retrieve the same set of data 10 years in the future for example. So the first check that is done is done actually by the proxy itself. So whenever you store a new object within Swift, Swift will compute a checksum and store it along with your object. So when it retrieves the data back and tries to send it out to you, the checksum will be computed again and compared to what has been stored on disk. So if there's some bit rot for example on one of your disks, the processor will immediately try to retrieve one of the other replicas and trying to send you an object that is actually still valid. But these checksums are also used by some background processes. So you have a bunch of continuously running background processes in your cluster. First, you have some auditors and the auditors are basically all the time crunching over your data, over your objects, recomputing checksums, comparing that to what has been stored before. And in case there's a failure, some bit rot, for example, will move this replica to a quarantining section and actually you're missing one replica in that case. So because you're now missing one copy, you need to somehow replace it, right? And Swift makes it possible, or does it of course on its own, using a replicators process. And the replicator is all the time checking that there are for example three copies still left in your cluster. So there might be copies missing because the auditor quarantined an object but another case might be a broken disk, a missing server, or you upgraded your cluster, added 20 new nodes for example and the new nodes don't know about the objects yet. You can also store data, not only replicated multiple times within your cluster, but you can also use erasure coding. In case of erasure coding, each object that is sent to Swift will be computed with some erasure coded fragments and stored across your cluster. In that case you have a reconstructor process that tries to recompute the missing erasure coded fragments for you. So that's running in the background. So Swift typically is a run on large clusters. The smallest cluster that is actually described in the upstream documentation is about talking about five nodes, but typically we are talking about dozens of nodes or even not hundreds of nodes within a Swift cluster. And these hundreds of nodes for example, they might run across multiple data centers or multiple floors in a data center. They might be connected to multiple storage outlets or whatever. So Swift makes it possible to actually group devices into failure domains using various mechanisms. So first and foremost, it tries to store data on different disks. It will never store two replicas on the same disk. It will actually complain if there are less disks in your cluster than you configured replicas to use. So okay, that means if I have for example, six servers here with a bunch of disks, Swift will store them on different disks. Now let's assume I have two floors in my data center or two power outlets. And I want to ensure that I can retrieve data back even if one of the floors gets unavailable for whatever reason, my power outlet is broken, things like that. In that case, I can group devices into zones and Swift now will do its best to, well, will ensure actually that at least one replica will be stored in a different zone. If you have more zones than replicas, then it will distribute it across multiple zones. So that's great, but it gets even better on my opinion. You can also, we have also a concept called regions that you can put on top of that. So if you have, for example, multiple data centers, let's say you have a big data center here in Berlin with two floors and a smaller data center in Hamburg. And you want to ensure that you keep one copy of your data in Hamburg and two copies on each floor in Berlin. Then you can group these devices also to regions and Swift will actually ensure that each region has a copy at least and each zone has a copy. And by doing this, you can easily build clusters that are spawned across multiple data centers, multiple regions across the world, for example, one in different cities or even across different countries. All right, so how does Swift now where to store your data? Someone needs to tell Swift how my cluster is designed, where I expect data and things like that. Swift knows or uses a concept of consistent hash rings. So every object that is stored in Swift will actually, or the object name will be used to compute a hash value. And this hash value will be then the complete space of possible hash values will be divided into partitions. So typically you run a few 100, well, few 10,000s of partitions in your cluster and each disk gets, for example, 100, 200 partitions out of that total key name space and it gets distributed along this. So this ring that is shown there is a very simple one. It has only eight partitions. For example, let's say this is partition number one and each color maps to a different disk inside your cluster. You can see that each of these partitions never uses the same disk twice, gets somehow distributed equally across the cluster. A real ring would use much more partitions than this but just to make it a little bit more clear. So rings itself are never managed by Swift itself. You have to do that as an operator. Back in the history of Swift, there were some ideas that Swift could automatically manage these rings. For example, if you have a broken server or a broken disk that these disks get removed from the ring and Swift automatically rescales this to the remaining parts of your cluster. But actually it's a little bit difficult because for example, the cluster doesn't know why is the disk missing or why is the storage not missing. Is it just a temporary issue? Will the storage not be up again in five minutes or tomorrow or is it totally lost? So you as an operator have to maintain the rings and the ring will then be copied to each note within your cluster. All right, so the ring is defining how many replicas for example, you have in your cluster. But not all of your data might be ideal for let's say three replicas. It's quite expensive if you store three replicas but you normally don't expect to reuse these copies. For example, if you're storing backups, of course you want to ensure that they are still there in 10 years or five years. But hopefully you never have to use them again or maybe only once, right? So maybe you have some data that is better suited for cold storage kind of data, storage, erasure coding for example. But at the same time you have other data that you access very frequently or you have other data that you can easily re-compute. In that case you don't need three replicas, maybe you are fine with one or two and in case you lose them, you just re-compute the data. Swift has a concept of storage policies and that basically means that for object storage you can use different rings. And these rings are defined by the operator and the user selects when he creates a new container which ring to use, which storage policy to use. So you have maybe one container in your account with erasure coding that's more cheap to operate on and a different container with three replicas, more expensive but a little bit faster or yeah, a little bit faster actually to access. So talking about object storage solutions, in many case you are well-affected by eventual consistency or many object storages actually are affected by an eventual consistency. What does it mean? There's a so-called CAPS theory about consistency, availability and partition tolerance and this applies in case your cluster is partitioned. So if you for example have two data centers, one data centers down, so part of your cluster is no longer available, then you need to make some trade-off and basically you can choose two out of these three and in this case, Swift choose availability and partition tolerance so that even if the cluster or both, part of the cluster is not accessible, that you can still retrieve data. But this comes with this drawback actually. Let's say you start three replicas, your cluster was fine, all three nodes storing this data were up and now two of these nodes get down and you overwrite this object and it will be stored, it would be stored on the same set of servers actually but two of these servers are down so the proxy selects a so-called hand-off nodes and these hand-off nodes will keep the additional copies. So you have one of the original nodes up, you have two hand-off nodes up and the object will be accepted, will be stored for you and now the two nodes that were down get up again. Now it takes a little bit of time until your replicators running in your cluster will fix this for you. In the meantime, it might actually happen that you retrieve an old object back if you retrieve the same object. So that's where eventual consistency can affect you, if that's the case or if you want to avoid this, the easiest way is just to use different object names maybe even adding a timestamp for it but in many cases it's no problem at all because you're just writing once and reading many times. Same applies to container listings. So if your container server is overloaded or not available and you put a new object onto the cluster or into your container, it might take a little bit of time until the container listing shows a new object if the main container server was not available at the time when you start the new object. All right, let's talk a little bit about the Swift proxy server because the Swift proxy server is actually, well the gateway to your cluster and where your users are talking to. If you have a look at one of the configurations files in your Swift proxy server, you will see a pipeline. Actually everything is written in Python with then Swift and we're using whiskey pipelines, whiskey middlewares and you will actually see a bunch of middlewares that are shipped with Swift by default that you can make use of. Let's have a look at a few of them as you can probably already guess by their names what they are doing. So first we have a container sync middleware. This is quite useful actually when you want to exchange containers between clusters. We are talking about different clusters in this case. So it might even be that it's a cluster from a different company or from a different organization within your different part of your organization. So you have a cluster on site A, cluster on site B, both are totally independent but actually you want to share some data between these two clusters within for example a single container. In that case you can use a middleware that's called container sync. If I remember correctly, there are some public cloud vendors who actually support this. So if you have a small cluster in your organization, you could sign up with them a contract and make it possible that they store one copy for you besides the copies that you have in your cluster. We have a middleware for biocoperations. For example, if you have thousands of small objects that you want to upload fast, it's probably much easier to just send a tar file to your cluster and it tells you okay, extract them in your container and the bike middleware will handle this for you. Next are some authentication middlewares. So let's start with Oustoken Keystone Oust. Keystone is a project, also an OpenStack project used basically in every OpenStack deployment and of course Swift can talk to Keystone and use Keystone for authentication but of course you can write your own middleware if you for example need to hook it, hook Swift up to whatever else you have in your company already available. As there are also examples about authentication middlewares shipped along with Swift. As a name implies, there is a token send with each request that you need to authenticate to Swift but that's not feasible in every way. So let's assume you want to upload data or download data with your web browser from Swift directly and it's, well, you can work around this of course and send some tokens. For example, if you're using Angular but it might be easier to just actually use signed URLs and the temp URL middleware that you can see there is actually handling this. You can compute pre-signed URLs that are valid only for this object, only for a specific time and use that to download data or to upload data. We have middlewares for actually copying data within your cluster from object name A to B. There's some support for quotas both on the container level and the account level. SLO and DLO are both middlewares for large objects. Many object storage solutions actually have a limit on a para object size. In Swift case, it's five gigabyte. That matches very well with Amazon S3. If you want to store larger sets of data, for example, a large video file, 20 terabyte video file, you are uploading while you're chunking the large file, split it up into smaller objects, upload this to your container and finally send a manifest object to the Swift cluster and when the user retrieves the object back, it just sees this 20 terabyte object that gets sent back to the user. Versioned writes is actually a middleware to have a versioning system in place. So when you're overriding an existing object, the older copies will be still kept for you so that you can access them in the future as well. And one of the latest features that gets or got added to Swift, which will be part of the next release in a few weeks, is encryption middleware. So that makes it very easy to actually encrypt all the data that is stored on your disks and on your storage servers. So every object that gets sent to Swift will be encrypted on the proxy and the storage servers only store encrypted data on disks. There's no more plain text, there are no more plain text objects stored on the disks. Well, this was actually a very strong requirement for some enterprises, especially useful if you replace disks and you need, or if you're replacing disks and you need to send the disks back to your vendor and maybe your disk is broken, you can't overwrite it before, so by doing this you can ensure that at least there is no plain text data leaving your company. If that's not enough, it's very easy to actually write your own proxy middlewares. People have done that in the past for various reasons. One of the most popular one is actually the Swift 3 middleware which emulates the Amazon S3 protocol. So if you have an application that is Amazon S3 compatible, you can use it with Swift together with this middleware and people have also written other middlewares. They are not part of the upstream OpenStack Swift project itself, but still available, for example, to integrate Elasticsearch, to send out notifications if there are new objects to their applications, data processing, for example, to downscale images on the fly while uploading them to your Swift cluster. And if you're interested in that, I had a talk in the past about this, how to develop your own middleware and this one might get you started about this. Actually, you can not only write middlewares in the proxy server, you can do the same for the account container and object server. And IBM, for example, is using a similar approach to use a concept called stallets. Basic or simplified idea about this is instead of moving data to some computation and doing computation on the data, they move or they use the spare resources on the storage nodes to do some computation on data. So you have some computation possibilities on the Swift cluster, especially if the Swift cluster itself has some resources still available, computer resources. All right, let's talk a little bit about using Swift. So there's a, well, easy setup or downscale setup to run Swift on a single node. Don't use it in production, that's a bad idea, but it's very good to actually start developing if you want to do upstream development for Swift itself, if you want to start writing applications for Swift, and in this case, you can run everything on a single machine. It's described very well on the OpenStack documentation website, but there's also an Ansible file that I made available using Baygrand and ADO on Fedora24 to run a small single node environment within a few minutes so that you can try out Swift, see if it's something for you, and yeah, you find that on that link, the, I think the slides are already, hopefully are already, if not, they will be available later. So Swift comes with a command line interface, that's called Swift, surprisingly, and these are a few of the, well, most used commands, actually, status for statistics. It gives you some information, in this case, about the account. If you post a container, it creates a new container for you. If you just want to upload container as some data to your container, you just give it the container name and the file name, and then you can list your container. You can send a statistic on your object itself, it will return, for example, the total URL inside your cluster, where the object is accessible, and download data using a download command. If you don't want to use the Swift command line interface, which is likely if you write your own application, then you can use the debug option, and the debug option will show you various information, what's been done in the background, and it will also include some curl examples. So in this case, it's the full URL of your object, alongside with a header, that needs to be used when you want to access your object. As I said earlier, token-based authentication is sometimes not feasible, most likely when you're using some web browser or web appliance, and in this case, you can use signed URLs. So the Swift command line interface has some built-in examples, how to compute the signed URLs, and the basic idea is with the first command, the post command, you're actually storing some metadata inside your Swift account. And this is a key, well, a metadata key with a value, not very sophisticated value, but anyways. And the next step is you compute a signature for this object, for example. What you're doing here is you tell, you say which message you want to support for the signature, so it's only valid for a get request, it's only valid for one hour from now on, and only for this object in this case. And what it will return is something like this. You can see a signature at the end of this and some Unix and Poch timestamp when this request will expire, and it's only valid for this request. So that's easy to get data in AudioFi cluster. You can use a very similar approach if you want to do form-based uploads. So in that case, you are also a computer signature. And as you can see here, there's one of these hidden flags, including your signature, that is only valid for this name and only with these parameters. So even if you try to change these parameters in your form later on, or your user tries that, it will become invalid and the proxy will deny the request. It's only valid for this request. So that's that. Let's have a look at this smaller VM that's running here. So first request is a Swift list. I have a single container in my Swift cluster, in this case. I can show you some statistics about the whole usage in my account. As you can see here, I have a single container. I don't have stored that many bytes, zero bytes. Actually, it's a pretty plain empty account at the moment. So let's create some object, my hello world object, and do an upload to my container. All right, my statistics should have changed now. Not yet, okay. That's the eventual consistency that hit in right now. But what we can do is, actually, we can list my new container, it shows my object, and of course, I can download it again using this request. So if you're interested in the backend calls, let's do the same with the debug option. And what you can see here now is, let's see, the first thing here is I try to find, to do a container listing. So I have my URL, this is my proxy server listing here. I have my account name and then a container name, and it tries to receive data back in a JSON format. And this is a response from the proxy server. And it, for example, includes, where is it? Yeah, some statistics. And the last request, actually, this one, is your download of the object. So it includes the object name, it is a get request and it has some authentication token. So if you just use this command, and just copy this one, like this, then you will see that you get your object back, right? So if you just remove the authentication token here, in that case, well, you would already accept, expect this. It's an unauthorized request, so because the authentication token is missing. So in the back end, if you look at the storage nodes itself, in this case, they are stored all on a single round point. Here, you will see something like this. There's only a single object stored in my cluster right now. Each of the objects has a suffix named data, and here you can see the pre-computed hash value of the object name, and actually this number here is your partition number. So it gets distributed across multiple disks, of course. I have three disks or four disks in my cluster at the moment, and it was distributed here. So if you just have a look at one of these files, I would expect my hello world content here, my hello content. Oh, there it is. Okay, so the tokenless requests, yeah, tokenless requests. I wanted to give you a short example on this too. So there's this command that accepts various arguments or expects them. So I wanted to issue a GET request. It should be valid for one hour, and I need my pass to my object, my full URL pass. So let's get this from Swift itself using, wait a second, container, this one. That would be my signed pass, my signed URL. And if I go to my web browser now, I can use that. So I have a small example for you that actually uses the same approach using the form post middleware. So let's upload some data. World, TXT, I delete, before I do that, I delete my existing data to really show that it gets accepted by the Swift cluster. New container, all right, it should be, yeah, it should be gone. So let's upload this one. And actually what happens in the background is, you have, can you read that? Yes, you have this very basic HTML form, right? You have the URL, basically a few options that I've shown you before. And finally, redirect after the upload. So the Swift proxy server, when the upload was successfully done, will redirect the browser to some other URL. In this case, the download URL running locally. And if I do that, this is my download URL. You might see that there's 201 stages in this URL that was actually done by the Swift proxy server. And if I try to click on this one, let's show you the original URL. This is my pre-computed URL. And in this case, you can actually open it and do whatever you want with it. All right. So a few things that I'd like to add, especially if you start developing applications for OpenStack Swift, that you shouldn't do or should do depends on, first thing is, Swift is really built to scale across multiple accounts, multiple containers. It came out of the requirement for public cloud provider. So typically a public cloud provider has hundreds of thousands, if not millions of accounts and each account stores some data. So it's a bad idea, actually, at least for the moment, to store everything in a single account and a single container. If you do that and if you reach a size of about, well, let's say 10 million of objects per single container, then it's a very good idea to, first, start thinking about your application if there might be a different approach for this. And second, to use SSDs to store your container databases on. That sounds more expensive than it is. Actually, the container databases are a fraction of your data only, and so it might be a good option. Upstream, we're working very hard to make container sharing possible and that will hopefully come in the next release. So it will solve this issue, but for now, you should just keep this in mind. As I told you earlier, we're using hashed names or we're hashing actually the object names in Swift to compute the location where to store the data. So don't mimic a rename using a copy and delete. You can do that, of course, for a few objects, but we saw people that tried to mount a POSIX file system like Gateway on top of Swift just using containers and the object names. And if there's, for example, a rename, actually sending a copy and delete to Swift, that's fine if you're doing that on your small one-kilobyte-sized object within your test. If you open that to the public and have 10 of thousands of requests per second, that won't scale. So copy and deleting or renaming, actually, data in Swift is a bad idea, actually. It's a bad idea to actually store the metadata in that case in a separate database because just updating a metadata field somewhere is a much cheaper operation. And if you're writing web applications, keep in mind that especially the container listing might not be updated immediately after the upload, might be updated after two seconds or so, but you shouldn't rely actually on that, at least not if you need it immediately. All right, I think I'm done, yes, with my presentation, so thank you. And if you have any questions, please let me know. Okay, everything unclear. All right, okay, thank you very much for attending my talk. And if you want to reach out to me, just write me a mail. I'm available at the Red Hat booth in the afternoon or on Dash, OpenStack, Swift, on ISC. And hopefully, see you there. Thank you.