 Please, given that we will have this one, we won't be able to go around for the questions, so please repeat the questions when there are open questions. And you have three scarves, and nobody so far given... Okay. For questions, you can give these scarves, absolutely scarves when people ask questions. Yeah, but the first one didn't have any... this is mine? I don't know. Yeah, I grabbed this myself. Yeah, just... Out of time is completely out of time. It's not really much. So we have this one, the freshly charged. Okay. So we can use one of you and the second, the other one will have the... It's just dropping in. They just heard us speaking like the batteries are bad, you know. Okay, so it's time. Can you hear me? Can you hear me okay? Hi, you're up? Hello? Can you hear me okay? Yeah? Can you guys hear me okay in the back? Microphone working okay? Alright. Luigi, good to go? Okay. Welcome. Thank you for coming. My name is Thiago da Silva and this is Prashanth. We both work for Red Hat at Red Hat. I'm a software engineer there working primarily on Swift. Prashanth and I work primarily on Swift. I'm a core developer for the community. And we're going to be talking about what we have learned about Seth Gluster in Swift. The reason we wanted to talk about that is because we find it pretty cool. I consider myself pretty lucky because even though I work on Swift, I sit right next to architects and developers for both Gluster and Seth. So we share a lot of knowledge. We talk a lot about principles of G-Sputted storage systems. And I find it very interesting, or we find very interesting to see the problems that each one have and they share a lot of the same principles, the same issues, and they come up with different solutions sometimes, or a lot of times they come up with even the same solutions for similar problems. So talking a little bit about those systems, of course they are all open source systems. They're in a way trying to achieve the same thing, which is scale out storage. These are systems that are trying to allow the flexibility of growing your storage system. As your capacity grows, as your need grows, they can grow. They also are trying to prevent vendor lock-in and also to be run on commodity hardware. The interesting thing about these three different systems is that they're very successful projects. Each one of them are deployed massively across many different organizations. We hear about CERN running Seth clusters. We hear about Facebook running Gluster clusters. We hear about, of course, Rackspace, who was the original author for Swift. You know, the huge clusters that they have there. So all systems, all these different projects are very, very successful. They share a vibrant community. They've been around for a while. So it's very interesting to see that all three projects are running very, very well. Talking a little bit about the storage types that they support. Ceph supports block and object today. They are working very hard on CephFS, which will bring in the support for file on Ceph. Gluster supports primarily file. That's what was designed for at first. But also, there are community members out there who also use it for block and object storage. Swift, on the other hand, was primarily designed for object storage. And it has no plans to support either block or file. So again, just talking a little bit about how similar or different they are. Prashant's going to talk a little bit about the architecture of each one individually. These three storage systems themselves are very complex and vast. So we'll try to have a crash course on each of these as quickly as possible. So starting with Ceph. This is how the Ceph storage stack looks like. At the bottom of the Ceph storage stack is the rados object system. And it's a reliable, autonomous, distributed object store. So before we get into the object system itself, what is an object? What is comprised of an object? When I say object, what it really means is you have an ID. You have the data in the object and also a metadata. And the ID is used to store and retrieve the object. The data is the actual content of the object. And in this case, metadata are key value pairs, which are attributes of the object. And on this stable, very cool object store, you have other layers built on top of it. For example, we have libredos. So if you have a custom application with a specific use case that wants to talk directly with the rados, storage system, it could link with libredos and store and retrieve objects in Ceph. You also have the rados gateway, which is a HTTP server which provides a rest interface to the objects present in Ceph. And block interface is one of the most popular and widely used ways to consume storage. When you have VMs, Ceph can export block device to the VM. And the VM can choose to attach it, format it with the file system. Or if it has any applications that can directly talk block, it can directly consume the block device itself. And then we have Ceph FS, which is a distributed POSIX compliant file system. It's more like NFS and it can be mounted and consumed in a VM or a client. So one difference in the way applications consume rados directly and via rados gateway is that rados is more low level access. So you can do things like store objects, partially update them. You can have compound operations on objects. You can get notified when objects get changed. But when you use rados gateway, it provides a lot of additional feature sets, such as you can put objects and segregate them in terms of buckets. You can authenticate users and provide ACLs based on buckets. You can also set quota on buckets. So rados gateway is more rich in terms of feature set. And also when you directly store objects over liberators. So the object size is typically small. And what rados gateway does is it allows you to stripe objects so that you can have really big objects when you store objects via rados gateway. And if you have existing applications written in which talks to Amazon S3 API, you can just use them directly with rados gateway. You just have to change the URL and point your application to rados gateway instead. And it also supports the Swift API. So in terms of Swift's terminology, a bucket is a container. So if you are talking in terms of S3, you can put objects into buckets. If it's Swift API, you can put objects into containers. So let's go a little deeper into how the rados object store itself looks like. So a Ceph cluster primarily has two kinds of demons. So one is the OSDs themselves. And these are usually hundreds to thousands in numbers. And they take care of one disk usually. It can be a RAID group as well. And their primary purpose is to store objects and serve them to the clients. And when I talk to, when I say clients, it can refer to a liberators application. A client can also be the RBD layer. It can also be CFFS. And OSDs are not passive devices. These are very intelligent, very cool devices. And they usually take care of replication in there. And they also make a note of other OSDs going down and report it to monitor. So monitors are the demons that knows how the cluster looks like. It knows what OSDs are down, what nodes are down, what nodes are up, what devices are in the cluster, what devices are not. And its main purpose is to maintain the cluster membership and state. And it uses this algorithm called Paxos internally to vote and come to a consensus for things whichever, for decision making. And monitors are deployed in small odd numbers because you usually need quorum when you have to put things to a vote. And if you have monitors in huge large numbers, it takes longer to come to a consensus. If you deploy CFFS, there's an additional demon that needs to be running. That's called MDS, a metadata server. And what it does is it keeps track of all the file system metadata that's there. This is how the rados cluster would look like. Now let's look into a brief introduction on how Ceph places data itself in the rados cluster. So Ceph uses something called as the crush algorithm. So Ceph can store millions and billions of objects. And there should be some way to segregate these objects. And here a Ceph cluster is divided into pools. So why pools? So there could be a use case where you need some set of objects replicated twice, some set of objects replicated twice. So a good example is let's say you provide a system where you allow users to upload your images and you need to store thumbnails. You may want to store images twice for resiliency, but you don't really care about thumbnails. If you lose them, you can always regenerate them. You may want to store just one copy of the thumbnail. So that's where you can create two pools. One pool can store three copies, another can store one. And also, as these clusters are huge and contains so many objects, it's so hard to keep track of where each object goes. So you group it into placement groups. And placement groups are a collection of objects. And these placement groups are mapped to OSD. So by default, when you have three replicas, three OSD demons are responsible to store and retrieve those objects. So here's an example on how crush determines where to store and place the objects. So crush uses the name, the identifier of the object itself and the pool name to derive the placement group. And crush knows which OSDs are responsible to store objects that belong to that particular placement group. So there is a primary OSD that initially accepts the request from the clients and stores the data, which is eventually replicated to two other OSDs. And these are synchronous. So the OSD is given back to the client only after all three writes have been done. That's pretty much how Chef manages data placement. Now let's look into how Gluster does it. So here you have a bunch of nodes and all of them contain some bunch of disks. So Gluster has this terminology called as brick. So a brick is typically a disk or a mount point. And what Gluster does is it combines the capacities of these bricks spread across various nodes and presents those to the client as a single namespace. And the single namespace is called as volume. So volumes are made up of bricks. In this example, you have those blue bricks which form a replica set. And the brown ones also form a replica set. So just like in the Chef stack, you have an object store at the base. Here you have a robust Gluster FS file system. And on top of Gluster FS file system, you have various other access mechanisms such as Fuse, NFS, SMB, and also object based access. Just like Chef, Gluster also does not have a central metadata server. It does not have to look into a place or talk to a server to know where the files are. It can compute on a real time basis. And just like Chef, it also uses hashing to find where the data is. So you have the entire hash space divided into different ranges. And these ranges are assigned to bricks. And when you store an object, Gluster would know where to... When you store a file, Gluster would know where to place that file. So let's look at an example. Here I have three bricks of different sizes, two are of same sizes. And what Gluster does is it will assign a hash range to each of them depending on the size of the brick. So if a brick is bigger, it assigns a larger chunk of hash range so that more files would likely go into that brick. And then when you want to store a file or access a file, Gluster hashes the file name. And from the hash, it will know which brick to place the file in. And unlike other object store like Chef and Swift, as Gluster FS is basically a file system which allows a rename. Their relames are handled in a specific way because when you change the file name, the hash changes. So when you rename a file, they create a special pointer file without moving the actual data immediately that points to the real location. And as and when you want to grow your cluster size, you add devices. And when you add devices, the existing data is moved to the new devices so that there's a uniform distribution again. Gluster has logic there to move it to an X brick, but it can point to the original brick. So it has this logic of... It's called a link to file. So... It's the same case as the rename. Yeah, it's the same case as the rename. So Gluster FS by design is very modular. As in you can stack up different modules. And in the earlier example, you had seen bricks. Here two bricks are combined as a replica set. And when the distribute module writes data, instead of writing to one brick, it will write to a replica set here. And the replication module in Gluster is responsible to provide high availability in case of failure. It follows a transaction model. Either it writes to both the replicas or does not write to both. And when one of the bricks are noticed down and when it comes back up, it has the capability to copy the new latest file and overwrite the old stale file in the other node. That's called self-healing. Yep, that's pretty much about how Gluster does it. Swift is an open stack. Yeah. If it's just pure distribute, you do. If it's... Let's say the first brick goes down and you don't have access to that. There's a configuration in Gluster where by default you can still write to the other brick. The client still gets to do IO on the other brick. And when the first brick comes back up, all the IO that was done on the second brick is copied back to the first brick automatically. Does that answer your question? No. If you go on the back end, you can still access them. It won't be presented. Can you just add one copy? Yeah. Let's say the first brick is down now and the client cannot access it. So all the IOs happen to the second brick. And when the second brick is doing IOs, it will mark that file saying that the other brick is down and this file needs to be healed back when the other brick comes up. So in the back end, although there is no central metadata server, there are directories where the AFR module keeps track of files that needs to be healed. So... So that's why it's so important to point out that it uses this hashing algorithm, this very deterministic hashing algorithm. So like Rick said, you have two replicas. One goes down. Your hashing algorithm for that file that is currently stored still points to that second brick. It will point to both bricks, right? One you can't get to. The other one you can get to. So you still have access to your data, which I think also answers. Exactly. Yeah. Swift is purely an object store. In Swift, a cluster is logically divided into accounts which contain containers, which eventually contain objects. So Swift contains four services or demons running. One is proxy server, one is account server, container and object servers. So the container servers keep track of what objects are stored in that container. The account servers, they keep track of what containers are stored in that account. And these are stored in back end as SQLite databases. And just like objects are replicated thrice by default, even the SQLite databases are replicated thrice. And a client to Swift can be any HTTP client. It could be your mobile phone. It could be a web browser. And the proxy server here acts as the single entry point to the cluster. So clients never talk to the back end node directly. All the traffic goes through the proxy and the proxy knows how to route the request to the appropriate demon or service in the back end. And for example, Wikipedia, all the images and thumbnails that you see in Wikipedia are stored in a Swift cluster. Which is pretty cool. Yeah. So you can have a load balancer and multiple proxies. Just like Chef has a crush map to know the map of the cluster, there's something called as ring in Swift. And these are internal data structures. So when I say a ring file, what an admin of the cluster does is he creates these ring files and pushes it out to every node so that every node has an idea of what the cluster is, what it looks like. So whenever you add devices or nodes, you need to recreate these ring files and push out to all nodes. And Swift also uses hashing without any central metadata server. No surprise there. So Swift divides the hash range into different partitions. And it pre computes these two tables when you create the ring file. So here when there's a request to create or retrieve an object, it will do a MD5 some hash of that object. And it will determine the partition. And each partition in this example is assigned to three devices. And then you have a table of devices. And Swift will know exactly that which object goes to which node. And when it creates these data structures and tables, so Swift has this replica dispersion algorithm that makes sure that the replica pairs are in different domains. So for example, if you have just one machine in your Swift cluster and three hard drives, it will make sure that the three replicas are in three hard drives. If you have a huge cluster, it will make sure that, let's say the three replicas are in different racks. So Swift is very intelligent when it forms these ring files to make sure that the replicas are in different failure domains. That's about the architecture of these. Chiyago will explain about further similarities and differences. Thank you. So hopefully, you're able to see some of these similarities already that they share. So for example, again, they all, a scale-up system, they allow you to grow as your capacity grow. They're all using some kind of hashing algorithm to determine where your data is located and what disk. They're all trying to save your data in such a way that if you have failure, you don't lose your data. For that, they both, up until now, there's primarily been using replication. More recently, all these systems are also introducing erasure coding. So looking a little bit deeper into the similarities of them, so we can see that in Ceph, the Ceph client speaks directly or talks directly to the OSD, as Prashant mentioned, and you typically have a mapping of an OSD service to a disk, to a low-level disk. And the OSD disk is usually formatted with an XFL file system, XFL file system, because Ceph uses X-eders on the file system to store metadata. The same is true for also Gluster and Swift. They all using the XFL file system at the very low level to store the data on disk. So in case of Ceph and Gluster, the client speaks directly to those underlying services, in case of Ceph OSD, in the case of Gluster, Gluster FSD, in the case of Swift, you have the proxy communicating with the Swift object server, which in turn also uses the XFS file system to store the data there. Looking a little bit more in terms of how each system does redundancy and rebalancing, in Ceph you can determine your replication or your redundancy level granularity at the pool level. So in a cluster, you can define several pools, and in each pool you determine, okay, I want this pool to have three replicas, I want this other pool to have two replicas, depending on your use case. In Gluster, you determine your redundancy at the volume level, and in Swift, you determine that at the container level. So you define what's called a storage policy, and when you create your container, you select what storage policy you want. In terms of where that data is placed across different failure domains, in Ceph it's managed by the crush algorithm, and also in Swift it's managed by the range like Prashant talked about. So what Ceph and Swift allows you to do is to define your cluster topology. As you are first defining your cluster, you can provide what your topology looks like, your cluster topology looks like, both Ceph and Swift will use its algorithms to place the data in different failure domains so that you don't lose your data. Gluster differently, it's more of a manual effort by the admin. You have to know as you're defining your volume, you have to specify what disks to use for that volume, and the admin himself would have to know, a disk is in this rack or this rack or this rack so that they are placed in their own different failure domains. For that, Luis is going to save us, and he's actually working on a project, it's called Hackaddy, that will automate that process in Gluster also so that it automatically places or creates volumes using disks in different failure domains so that it's such not a manual process. He's going to be talking about that to me. So there you go, did your ad for you. In terms of both replicating data that has failed, as you guys were asking before, or rebalancing data as you add more nodes, you need to rebalance the data. In SEF, it will rebalance a whole placement group, so instead of rebalancing or moving one file at a time, trying to figure out what file to move, it rebalances a whole placement group. In Gluster, it actually rebalances or replicates individual files, and in Swift, it will rebalance or replicate partitions. So again, remember that placement groups and partitions are sort of the same idea of their ranges of your hashing algorithm. Just to understand a little bit more about how each one of the systems do replication, when an RDB or a LibreDOS client sends data to a SEF cluster, it will communicate with one primary OSD. It will send data to that primary OSD, and that primary OSD is responsible for sending data to its secondary and tertiary OSDs. Those three OSDs, again, are in the same placement group. Once it gets the OK from those OSDs, it then returns to the client saying, I have saved your data durably. Gluster does more of a fan out from the client itself. The client is responsible for speaking to the three different bricks. Again, I'm assuming here that it's a three replica system. Hopefully that was clear. But Gluster will communicate with three different bricks and send the data or fan out the data from the client directly. So the client here is responsible for doing the replication. In the SEF case, you have the HTTP client sending the request to the proxy to put the data, for example, and the proxy is then responsible to fan out that data to the object servers, to three different object servers. What's the takeaway here is that all of these systems, they are responsible for storing your data in a very durable way. They will not respond to the client until you have some kind of quorum that your data has been saved. So for example, in the case of three replicas, at least two of these replicas must have responded to the client or the proxy in the case of Swift that the data has been saved. Just to show a little bit more of similarities, in the case of SEF-RGW, just to finish this thought, in the case of SEF-RGW, it's very similar to Swift where you have an RGW gateway or a RATUS gateway and you have your HTTP client communicating with the RATUS gateway and the RATUS gateway is responsible for storing the data. In the case of Gluster, for example, if you didn't use Diffuse client and you used NFS, you would have an NFS server and your NFS client communicates with NFS server and the NFS server then is responsible for doing the replication for you. Yes. Thank you. Is that an answer? So we just wanted to show really quick where how does your data look like in the underlying file system? So in the case of SEF, SEF provides some tools for you to use, communicate directly with RATUS and put your data using RATUS tools and then look up where your data is in the cluster. So you can see that SEF, like Prashant talked about, it actually breaks up into small objects. It's not very human readable, so you'd have to go searching for it. Basically, you'd have to use some kind of tool to go look for it. So in the case of Gluster, the directory structure maps directly to how the data is stored on the brick. And in the case of SWIFT, SWIFT actually takes your URL key and it takes the end of that key there and hashes and it stores the file with that hash name. So it's not human readable at all. It's very difficult to go find that data on the underlying file system. Yeah, so all three systems, they use X-Atters to store a whole bunch of data. So yeah, it stores the file name, for example, there. In the case of SWIFT, you also have, and actually in the case of object storage in SEF, it also uses the container database to also store some metadata about the object being stored. Terms, again, just very quickly talking about some feature parity. They all support some kind of quota. SEF supports quota at the pool level, Gluster at the volume level, also directory and I-Node counts, and SWIFT at the account and container level. They all, both SEF and Gluster support tiering. That's sort of new features for them. SWIFT does not have a tiering feature yet. In terms of geo-replication, being able to replicate to far away data centers with higher latency, SEF and Gluster have an active passive or a master slave. Geo-replication and SWIFT provides an active-active replication where you can write and read your data from both data centers. Like we talked about before, they all support knowledge of coding and also bit-rot detection. As data sits there on the disk for a long time, you want to be able to detect bit-rot. That's all we got. We had a bunch of questions already. Any more questions? We got about five minutes or less. Yes. So you're talking about how does it replicate data on a failure? Yeah. Oh, boy. I know, for example, SWIFT uses a... It allows you to use a different network just for replication exactly to limit that. I'm not 100% sure about Gluster if Gluster supports that or not. So it would happen on the same network as the data path. Yeah. Absolutely. You get a scarf, by the way. Dispersion rules in Gluster. Right now, I know they have AFR and I know they're talking about using NSR, which has been talked about for a while and also striping, too. I think it's been talked about for a while. Almost. I know Gluster supports InfiniBand and SWIFT... I'm looking... Ilya, no? It's working progress. Sorry that I'm not expert on Gluster as SWIFT, as you guys can tell more on the SWIFT side, but I know Gluster supports InfiniBand. SWIFT is a working progress. I think we're out of time. I've just been told, I'm sorry. But thank you very much. Let's see if I can remember who... I think if you have any further questions, you can catch the guys outside. Oh, sorry. It's your first time. So just one minute. We can't do them all at once. We don't have enough time. Yeah, this is working progress. Do we have to, like, the whole new one? Yeah, this is the whole new one. Wearable. All right, so... I think... This is your laptop? No, no, this is yours. So it's the guy's laptop. Why the person that's here? Yeah, I mean, I just can't remember. So real life, it's like, like, it's quite a lot. It's like a game. Like, you can't read and try to see it. We don't need it. We have two mics. What do you think? It's a classic. It's going to be an interesting presentation coming in. Thank you. Because of the mics? So same laptop. At least you don't need to switch the laptop, right? And the slides. Yeah, yeah, same laptop. Same slide, okay? Yeah, yeah. That's at least... Do I need to mirror it? Yes. And tell Chargo. Yeah. Is it okay? No, I can't. Is it okay? It's longer than I thought about the... Yes? It's bigger. Bigger? Bigger factor. Yeah, we've seen talk. Yeah, can I have a quick look? Yeah. But it's not here. It's not here.