 from the CoreSeph team to talk a little bit about our object storage, the RADOS gateway in SEPH. Hello. Thank you for coming so early on Sunday. I know it's hard. I'm a RIT. I'm part of the SEPH team in RADAT and I work on Cloud Object Storage in SEPH. And today we're going to explain what Cloud Object Storage is, how it's different to regular storage types, a little bit about SEPH. So first, who uses SEPH? Not much. We need more hands. So we'll explain what SEPH is, how it's built, and then we'll talk about for this gateway, who is the part of SEPH that provides Cloud Object Storage interface. And of course, we'll have time for a few questions. So in general, I prefer people to ask questions during the presentation, so if someone needs a question, just raise your hand. Let's start. What is this Cloud Object Storage? So we have block storage. Anybody who uses block storage? Yeah, of course. We all do. So storage is divided into those fixed blocks depending on the device. We don't have any metadata related to the data. We just write the data somewhere in the device of SEPH length. And the application needs to manage to know what exactly it's written there. Where it's written, you can have several devices. But it's really fast. Well, first of all it. And today we're going to say it's really fast. We have the other protocol, fibercharm, SCSI, SATA, iSCSI, and some people even use fibercharm over the internet. But using block storage is really hard for applications. So then there are five systems. So here the data is organized. We have a high array of directories that contain files and other directories. We have users. We have metadata. For this data, the file can be in any place. You can override it in the middle and allocate on demand. It's much easier for the application to use the storage. But it's slower than block. There's lots of arguments. Sometimes five system people, but it's slower. Because of the all metadata and you are here on the link. Not all five systems, but many have shown semantics. So you actually can let the file read only or complete me for you. And two different applications can use and write to the same file at the same time. We have block of five system protocol like XC4, XFS, and so on. And we also have network file protocol like NFS and SMB. And F2 is the effort protocol by the class they decided not to use it anymore and they're moving to SMB. But then came something new. It was the cloud. So we had that problem before the cloud came. We saw the studies growing and growing. So people moved to more distributed storage systems. And writing a distributed file system is really hard. I don't know if many of you have tried writing a regular file system on, but when you try to distribute it, the later they are, you need to sink it all around. It's a problem and there are a problem with just the people you can say. But in the cloud it became even harder because we're not only talking about a large amount of data that it accessed all around, but we're talking also a large, really big data chunks. And then came one company and then came the object storage that existed before the cloud object storage. But it was mostly like a niche, I guess, and I did research stuff on object storage, but it wasn't used as a standard. But nobody used it. But then came Amazon, actually. And when they started their own object storage, it caught up. I remember it was something like, I guess, ten years ago when S3 was launched, and I thought, wow, this is the way to do storage. I thought it really was amazing as a storage person. I was learning IBM research and actually said, no, we need to do something like that in IBM. But of course, IBM is not good at being innovated and they never did anything like that. So the idea is very simple. One, we're talking about cloud environment. So you want the API to be really for cloud. So that means it should be HTTP-based, so that's West API. You want some of the amazing of the data, but the complexity of the fly system is really hard. So it's just two levels. You have this Amazon from the buckets, Swift called them containers or pools. So just the container of objects is the way to organize your object into groups and maybe add some properties for that. So that's bucket, and inside the bucket you have those objects. And objects are not just data. You have the data, you can add them all sort of metadata. Some are fixed and some you can decide on your own. So you can let them find them and know what's the data in them. You have users and tenants, which is like an umbrella for user to share the same data between them. You have authentication, ownership, access control list. And you talk about large objects, really large. So that means you need lots of storage. Because those are large objects, usually objects are immutable. Again, you write them once, if you decide to change them, you need to actually override all the objects. You cannot write in between. And it's much more efficient for those large objects to be immutable and it simplifies the allocation. So the protocol today, the most common one is Amazon S3. It's the richest one. And I don't know what scale Amazon S3 is, but I'm sure it's huge. I cannot imagine how much storage S3 has. In Epistar we have Swift and there's also Google Cloud Storage. So this is the ground around the API looks. Why is an object get, this is a bucket. Create an object, bucket, get, read the bucket. You have actually, it can be more complex because you can read also the data that I use in head. You can delete a bucket, create an object. But it's not a single copy of version. This is a simplification because as you see there's some headers there. Usually there's lots of data that says all sorts of metadata or maybe make the operation more complex. So we talked about really large objects. Anyone here ever downloaded stuff from the internet and you lost connection, and you need to restart the download. Everybody, it's really annoying. And then it takes a lot larger but we can get to jiggle size. So this is can be really hard. So for that we have the multi part. You want to upload or download an object but you want to maybe stop it, maybe retry. So you take a big object, you divide it to small parts and each part is being uploaded or downloaded parallel. So you can handle network problems. And then on the internet you can continue where you stopped. And you can stop, start, upload, download. And also it's good for streaming for example. When you don't know the size of the objects you're actually going to upload or download. Download usually knows but upload. You sometimes generate the data when you upload it and for that you use multi part. But it's harder for the stores because until you finish your upload you don't know what the object will be and we cannot commit it. So we need to start all those temporary parts. And sometimes you will have failure and we start the upload completely again. So we need to in the background to clean all the extra stuff we got. Another feature which I really like. So here we have used here VMS. Oh no one. So remember VMS? We have the file version. When you write on the file you get a dot something. I really liked it because sometimes you by mistake delete stuff. And it will be good. You always have the previous copy. So then in cloud storage we have versioning. But if all buckets, it's per bucket. But if all buckets are not in versioning that you can elaborate and that's mean. When you override the object it creates a new object with a different version. If you delete the latest you will still have the old version. It's really helpful when you delete stuff by mistake or overwrite by mistake an object. And cloud objects are really really cool stuff. I didn't talk here about authentication. Because every protocol does it completely different. The main difference between Google Cloud Storage and S3 is authentication and it's incompatible. Swift does it differently. Some use user passwords. Some use user and a key and a secret. Everything is really different. So authentication is complex. And we have object life cycle. For example you can set a date in the object and when it gets to that date called object exploration. Website can be deleted automatically or moved to cloud storage. And lots of cool features. Any question about cloud storage? Let's talk about S3. So S3 is a set file pod which is the family of bucket poses and squid. So all our versions are called of one animal that name. So the latest stable version is dual. I think that's one is dual. If I remember. So S3 is open source. This is our github. We have set all under it. It's all the S3 code. It's not only S3. We have our own testing framework. Call to Trilogy is also open. And we have additional tests. We have S3 tests. S3 compatibility. We have other components of S3. So S3 is open source. We are in problem. So it's open source. We have software defined storage. Well, we are in the software defined storage room. That's mean it can run on many kind of platforms. You can configure and adapt it to different topologies and scale. It distributes system and like you say, distributes system. It doesn't have any single point of failure. When you have a large number of nodes, you cannot handle a single point of failure. We assume everything will fail. Large scale means failure happens. It's massively scalable. It was built to be large scale. So we say POS3 can be three nodes, but the minimum is five. And I actually recommend more. It's a large scale system, not for two nodes, three nodes. Because we are large scale, we assume that we have to either replicate data or use it as a coding to be more efficient space. Because we assume they don't have that less, so we need more copies. It's a large system. Failures happen. So we want self-healing because you cannot handle those failures manually. Everything should happen automatically. And we have unified access. So we have self-healing as the name. It's the file system access. It's a positive complying file system. You can use our client fields or our kernel client and there's integration to OpenStack Manila. And you can use NFS Ganesha with it for NFS and Samba for SMB. And it is the block interface. Many know it from OpenStack. It's integrated completely with QMU. So you can use it with QMU for Xen or QMU for KVM. And you can use the kernel client if you want your own look device. And we have BodeSketway, which is a component that provides the cloud object storage. And then if we solve, we have Redis. And I'm going to speak a bit about Redis. So Redis is a reliable, autonomous, distributed object storage. So as you see, it's an object storage, which is interesting. So when self-started, it was to build a very fast, scalable, distributed file system. And the style would Redis to be as the rich object storage that you can build the file system on it. Easier, please. But a distributed file system is never simple, even if you have distributed objects storage underneath. And in the meantime, it took to develop CFFS, the StyledRBB, and Redis Gateway. And now we also have CFFS as a product. So what Redis does, it does all the distributed stuff. So it does all the replication and all the ways you're coding. It's a flight object instance. And we call, we have a pool of objects, and you can configure it to pool. It could be a pool of very fast storage, this is this, or pool of slower storage in the hard disk. It could be different application. It could be freeway, it could be various accords. CFFS is a strong ecosystem system, and it's software defined. That means it's aware of the infrastructure and the topology, and you can match it to different architectures and systems. For placement, we have a house-based algorithm. We have a house-based algorithm with no lookup called crush. And we want performance, so we serve the data directory to the client. So we have it about crush. Crush is our placement algorithm. And we wanted an algorithm because it's the last case system that doesn't require any lookup. Lookup requires some central area. You keep the table of the lookup. You can replicate it, but it's not very efficient in large scale. It's a bottleneck. So crush doesn't require any lookup. It is topology. So that means you can actually put in cloud. You can say this is work, and I don't want to place two copies on the same work. So you choose nodes that are not on the same work. You can configure the application. You can do welds. And it's really fast-calculation, deterministic, and it's evenly distributed. That's why we call it solder-under, because that's where we can benefit from all nodes. So a surf-virus cluster has two kinds of nodes. The first one is the object storage device. Those we have lots of nodes, demons. Tens to ten halvalents. I think ten halvalents a lot, maybe thousands. And we have one OSD per disk. You can put an OSD on a wave, but it's a waste because we do the application ourselves. There's no reason to do additional application on the storage layer. And it's main thing it does is serve the object to the user. It's a smart storage node. That means it's spinning. And in case of a failure, it does the recovery automatically. The OSDs are talking with each other and we'll work with care. If a new node comes, it will rebalance the storage. The archive of nodes we have is the monitor nodes. Based on all the clustering logic, they'll maintain the cluster membership. So if a node fails, the monitors get a notification because it still notice that the OSD is not responding and it will notify the monitor there's a failure and then the monitor will spread it to the other OSDs. And when you add an OSD in the same way, we have a small number of those nodes, three, five, so on. They do use boxes. It doesn't have any data to the users. It's just for the cluster memberships. There you are. So we have this cool object storage cluster. We want to access it. So we do need borders and you actually can use these borders if you have your own application and it really needs lots of performance and you're willing to actually interact with the others, apply directly and you can use, actually some users use the borders directly. It has dined into super spicy, girl, Ruby, Python. I don't remember. Probably more. And it's a very rich API. It's not just a sweet, wide object. You have the key value store. We call it OMAP. So inside one object you have a key value store and you can use this to store. For example, the five system stores. The director is in an OMAP and I just get where the bucket index is the OMAP. It supports atomic single object operation so you can update attributes and keys and data on an object in one operation. It has snapshots for objects. Objects are mutable. That means you can actually partially override objects. And we have web clusters are very similar to storage procedures on a database. You can actually write code that will run directly on the OSD when an object is changed. It's a really strong feature. It allows lots of performance. We use it all around the services. And we have a watch notifier service so you can register an object. And when the object is changed, we get a notification across the cluster. Everyone registers notification. So it's a very strong API but it's not a standard API. It's not block interface, file or object. Cloud object. It's your one. And if you really import a performance or you really need to make the effort, you'll be really cool. Any question about self in German? Yeah. Just a few words about self in what's your question? Okay. First, all these are divided into placement groups. And as we always need to replicate one another, one can be the buster in some cases. In other cases, just we replica. So we talk with each other and they can detect when there's failure because they need to talk with each other and then if they get an error, they notify the monitors and then the cluster starts to under that case. So you need that FD that failed. It's taken a while to mark it as failed. The pen is failed. You know all the data it contains and it will be replicated to other SDs. So then they do it automatically. It'll depend on the size of the object. But the maximum size of the object in the radius is four mega. So four mega will be, let's say you have three ways of replicating. So four SD will have those four mega. If the object is larger, then it's more complicated. Do you have to retrieve the complete object? No. You need just one. Yeah. You need one to read. To write, you need three. That's all. If it's three ways, you need three copies. We write three times, but you read, you need one copy. Yes? Many lower. If that's the... Of course we prefer many high quality. Yeah. But the more you have machines, the more you can do large scale. Now we're going to talk about more about Fedesketware, which is the component in SEF that provides the cloud of the search. So like all the components of SEF, Fedesketware uses little weather to communicate with the others clusters. So we have this cluster of OSDs and monitors and that's where the data is stored. Another great way is a service built on top of that. So weather is object storage. So that's called, should be simple. We can have cloud of the storage, but life is never simple. So it's not just to provide the rest of the API and matching the API we have. When we look at our object and for others, another way or cloud, we see some difference. So first, the biggest size of an object is four mega and we're talking about cloud. So we talk about really large objects. So we'll need some way of taking bigger objects and divide them to those four mega size objects. So weather's objects are mutable, but in cloud we talk about immutable objects. The hardest part is that the pros inside of others are not indexed, but user want to list all the objects in the bucket sorted by name. That means we'll need to add some indexing in the bucket to allow this listing. And weather says per pool a CL. So pool is like a bucket, but we need a CL's per object. So even in spite of the fact that weather is object, we still need to add a layer to implement the cloud of the storage on top of it. So we have the weather's category cluster. We save all the data inside the weather's cluster. Weather's gateway is stateless. So in case you see that you have lots of actions and operations and the weather's gateway is loaded, you can just run another instance of weather's gateway on the same set cluster and both can work together and continue. And that way you can scale up. So again this is object storage. We have users and tenants. We have buckets, objects. We have metadata, per bucket, per object. We have access control list, per bucket, per object. We have authentication which is really complex because we support several protocols. Today we support two more cloud of links protocol as we enswift. In many cases you can use both at the same time. But sadly they are not compatible. So there are cases that it won't work. You can't upload multi-part with Swift and really it will mess with you because they sign differently the calculator, the signature differently. Versioning is very different. You cannot use both. Versioned bucket promise free will not be versioned when you access it from Swift and other way. The authentication of course is very different but we do support Keystone files free. So that's maybe okay. We also support NFS. When you say that NFS is a file protocol, yes it is. It's not a full NFS. If your main workload is a file system, the CFFS, it's for users that mostly use object storage and they can use the NFS. We use NFS gamutia on top of we call it a library, we call it a GW to allow that. And it's for migrating from NFS to object storage or exporting a bit. And also if you have one, this legacy application that made this NFS very basic stuff, then you can use it. But for full scale NFS, you use FFS or other NFS stuff. So we try to build let us get very accessible. So a request goes for all the layer one after the other. We have two layer, three parts that everyone can access all the time. I'll go a bit about those. So first we have the front end. Front end is what provide the rest API actually. So it has to, we need HTTP. So with two layers, the first is the old and un-recommended way. So we support fast CGI with Apache and probably other web servers. The reason it's not recommended is because fast CGI is lots of security issues. But sometimes if you already have your own Apache and you handle the security issue with fast CGI, it can be an option. Today we have also, from Hammer, Python actually, I don't remember. We have Civot Web, it's a web server inside of the wide escape way. So we recommend that. We then we go to the rest layer. It's actually the layer that converts from past the dialect between the protocols. So it passes S3, S3 if the other API. Then it goes to execution layer. Which is the common layer for protocols. We don't want to write code twice. So we try to use as much as common code handling different protocols. The next goal is the layer that talks to Redus. It's not enough to use the Redus. We need to do some stuff on our own. So for example, we need to get faster. We do the object striping, atomic overrides. And all the bucket index handling. And we also, it contains the object classes, the run and RST. We have quarter, and we have quarter per user, quarter per bucket. We have the execution layer, because in many cases, in many layers, sometimes you need to educate again. So we support AWS free, AWS four, Keystone two, Keystone three. And that is what we're going on, unless yes, but hopefully it will be end of carbon or maybe in luminous. Objects are large, so we don't delete them and free the space immediately because it takes times. So it will move to a garbage collection in the background as a process that contains the large objects. And if times, I'm going to skip some slides. So this is how we build objects. We take an object, we have a head object and tail. All objects have a head object. It contains all the object-mated-rate attributes and up to 15, 512 cases of data. So small objects only have an head, large objects are stripes and several tail objects. To have a fast access to objects, object names inside of it looks something like that. So the head, remember, 123 is the bucket ID. It's not the name because bucket can be named bucket, so it's a unique ID for bucket. And then the object name, that means that we don't need to do any lookup in order to read the head because it contains a metadata and many cases want to do metadata fast. And then the tail is the bucket ID and then some UUID and depart. We will do the bucket index. So bucket index is just all the objects in the bucket sorted by name. But when we talk about clouded storage with cases that one bucket contains millions of objects. And then the bucket index is also a bottleneck for performance and also because it's in the end stored in an other weather subject if we pass one object line, it's also inefficient. So first thing, if you have lots of larger objects, the search number is up to whatever K object for one bucket in is that's fine. If you have more than one under case, then you need to use shouting. We collected bucket index and split it to several objects. Depending on the number of objects you decide the number of charge you want. We have now support for offline shouting. So in case you didn't reshare the bucket and suddenly you got to large number of objects you can reshare it. I think it's in the summer, the summer, the 10 and it's in, will be in June and there's newer versions. And we're trying to work on the online resharing that you will need to offline the bucket and stop pirating it. You can do it with IO. So our quota is the, we are best to be the system. Object can be written for many, many places, several other gateways. We need some way to coordinate the quota. So we use the other subject class objects to do that. So when we actually write the object in the RSD, we update the quota. That's when the quota is not completely consistent with something we forbid, but I think five systems have the same issue. We have the method with the cache and we use WatchNotify to handle that. That's mean we encase the method of change. The other method is get a verification and then can invalidate that entry from the cache. So multi-sec. So we're talking about cloud, but cloud is never one data center. Many times we want several data centers and we want to have, I would suggest recovery, we actually want to use all of them. For that we have multi-site, I'll get our application. So you can take two set clusters and you need to configure them and there will be a sequence application around each other. It can be active-active or passive-active. The metadata operation, which is use-reparation, uses deletion, bucket question, bucket deletion and a few bucket metadata operations are sequenced because they are really important stuff and we cannot have a difference in those. So we have a meta-master and in case he fails, you need to do failover and configure a new meta-master. But the data is completely a sequence that means all different set clusters, different clusters, you can update the object and there will be replicated automatically to the allocasters. A zone is what we call a set cluster. So if you have two locations, each one will have a set cluster that will be two zones. A zone group is just a group of zone, a zone that has shared the same data. So even then, not in the same location, they will be in the same zone group because they will replicate from its monitor. We have all multi-site support in Hammer. It's called a sync agent. It's only active-passive. But from June, everything is active-active. Okay. Now stuff. So we talk about object life cycle. We don't support the full S3 object life cycle. We have object operation and basic storage policy which means that after the object expired, we move it to a different cold storage. We support object, we have more efficient object cold operation. We don't actually copy the data but just update the object head to point. So we know it points to the same data as a different object. We have a question. Thanks to Muantis. So all the objects are encrypted. We have compression. Also Muantis work. We support torrents. We have static website. Anyone wants to use static website. We have support. That's new in Carton. It's not tech-to-view. We have material search. You can do exporter data to elastic search and that way the user can search and you're working about doing online backlit resharping. I hope I have time to questions. No. Okay, questions. Well, you asked what's the performance of using wireless gateway and not wireless directly. So you need to understand they are very different. Wireless is rest API. Rest API is SCPP. It's a slow protocol. I think that's the main performance, apparently. But if you're talking about really, really large object, then I think the SCPP is limited. We have some of the red. We are not as wide as Bench because we have per-object also. It's not just an object. So I don't know the numbers, but there's apparently. Thank you.