 You could plug into the electricity if you want. I think I have enough, because it's better to use this one. Can I use this one? Yeah, you can, of course. Okay, you can try it. So this is leg slide? Yeah. And here still has a laser pointer if you want it. I can give the presenters feedback on what they said, what they did. There will be a prize at the end for them, so it will help if you vote for the best presentation. Also, don't forget about the lightning talks. You can propose topics. You can vote on topics. They're at 6.10. So if you want to participate, go for it. But for now, let me introduce Orya. She'll be talking about steps. Hello. I'm sorry, this is my voice. It's a bit low, and I got a call, so I hope if you cannot hear me, just... No, we can just talk to you, it's okay. Okay, great. So I'm Orya. I'm working with that in the same team. I'm currently working on the Redo's Gateway, and I'm going to talk to you about it today. I will experiment with what is SEF. For those who don't know, who knows your SEF? So who knows why SEF is called SEF? Yeah, so it's a family of animals in the sea. I'm not exactly sure. And that's why the SEF symbol is an octopus. So SEF, SEF is an open source project. It's software-defined storage, which all... You all heard this name a lot. Who knows what software-defined storage is? And yeah, good, some, because I'm not sure. Well, the idea is not it's just software, but it actually can fit different architectures and it fits itself to the hardware and not the hardware to the storage. Unlike many systems that actually are only software, but match specific hardware. You cannot install them everywhere. It is written mostly in C++. It's a distributed storage system, which I believe is the way to build storage because we are talking storage is increasing all the time. And many years ago, I was in SATA recording some exanet because exa was such a huge amount of storage and we're actually going to get there soon. I'm guessing there will be exa by storage. Beta is quite common. It doesn't, like every this good distribution system, it doesn't have a Singapore failure. It was built to be massively scalable. So that's mean it's not for small size of nodes. It's for a big cloud. It won't perform well if you use it on two, three nodes. It's for really large number of nodes and if you're not planning to really be large, you should choose other system. And because it was built for really large number of nodes, you cannot expect the user to handle errors. And there will be, because when you talk about thousand nodes, nodes will fail, hard disk will fail, your problems in the networks. Everything going to happen. So you need self-healing and it fixes itself. And it allows you to have a unified storage API so you can access it using file, block and object. A little bit of architecture. The base of SATA is RADOS, reliable autonomous distributed object storage. And that's actually, it does the replication and the distribution of data. On top of it, we have Lib RADOS, which is an API library to access RADOS directly. And it has C, C++, Java, Python. And on top of it, it's the free way we access storage. So we have C++ to access using a file system. We have RBD, which is what's commonly used in OpenStack. It's the block storage. There are two ways to use RBD. Lib RBD completely integrated into KMU. And KRBD, which is the kernel driver. And what I'm going to talk about today is the RADOS gateway, to provide a cloud object storage like S3, like Amazon S3. So we're going to talk about RADOS. So RADOS is responsible for the replication. It's an object storage, but it's not like a cloud object storage. It doesn't have RESTful API. It's really, it's a flat name space. Objects are divided into, you have pools and object reside in a pool. It's a, it slowly consists the system. And you can actually define the, define the placement algorithm depending on your topology. You can actually say that you want two copies not to reside on the same rack because you don't want two nodes on the same rack failure. And the placement algorithm is high space to allow really fast access. And it's called CRASH. And it's allow, and actually client of RADOS actually use the little RADOS to directly access the data. There's two types of nodes. First one OSD is the storage node. It's a very smart storage node. Usually in a cluster you have large number of OSDs between 10s and 10,000s. You define an OSD per disk or a red group if you want to. Clients actually get the data from the OSDs and it does all the replication and the cell filling itself. You have monitor nodes that, you have small number of odd monitor nodes. The reason for odd number is because we need chrome in case of petition. And that monitor all it does, it maintains the membership of the cluster. And we use a gossip protocol that doesn't have to spread it to all the OSDs because when OSD interacts with monitor, the actual monitor gives them the information and the other OSDs move the data all around. Client do not access the monitor directly. So objects are divided into pool. You can say a pool can have different replication. You can say this pool will have three copies, this one they have two. You can have different placements for pools, different access controllers. In order to access the object to know which OSD, under the object actually, we have what we call Placement Group. Placement Group is a group of OSDs and they have actually the data. So if you have three-way replication, we have three OSDs in Placement Group. And we use cross-angle with the object name to actually find which Placement Group the object belongs to. And the cross-angle is actually dynamic and when you add nodes or add OSDs, it changes the actual results in the right way. The cross is quite complex, I think we skip it. So the API libraries allows us, first of all, allows us a single atomic object transaction. That means you can actually update metadata and attributes to an object in one operation and it will be atomic. You also can have key value storage inside an object. You have snapshots and it allows partial override of existing data, a big difference from other object storage. And we have something like stored purchases in a database, we call those RADS classes. They actually allow you to give a piece of code that will run inside the OSD when the object is written to. You can also have a watch notify events when the object is changed. So a little bit of cloud of storage. Who knows Amazon S3 or you used it? Swift, OpenStack Swift. Those are all cloud of storage. They're also a Google compute storage. Or Google cloud storage. The API always restores API, HTTP based. You have users in Amazon and in Swift you also have tenants. It actually is like a big user that allows sharing between data between different users. You authenticate to the cloud. Amazon uses the key secrets. Swift actually password. You can give access control to objects. That means some user can write. Only some user can reread. You divide the data into buckets. Swift calls those containers, but the same. They're similar in some way to pools. But they're a bit different. I'll talk about the difference later. You have objects. Objects are like something between file and block. If block is just the data, objects also have metadata and attributes. But it's still a flat namespace. You have object in a bucket. Bucket as a name, object as a name. There are things up there, not like file system that can go on. The main example is actually Swift and Google cloud storage. There's a bit example of how you use it. This is a Swift, but Swift is quite similar. You add Swift and version. That's a simple HTTP request. You use that HTTP editor to pass the key. Not simple as it, because it's a hash, but you don't put a secret in plain text. But everything goes through the HTTP protocol. For example, to create a bucket, you just put a bucket name. You get a button to use get. To delete, you delete. If you create an object, the only difference is actually you give two names. We know that it's an object and not a bucket. Because we have just two levels. You can copy objects, read and delete it. Now we're going to talk why do we need a special service and cannot access radios directly in order to implement S3 or Swift? First of all, Radios Gateway keeps all the state inside of the Radios cluster. It's completely stateless. You can actually scale it out. If you have lots of clients, you can have many Radios Gateway on the same Radios cluster. Let him finish there. And all the data is kept in the Radios cluster. So even if one Gateway falls, you can just spin another one, and it will read everything from the Radios cluster when it needs it. So today we have a squad simple way to declare Radios Gateway. In the Gateway, you just use safety floor, and this is one line, and you can do also prepare and then activate, but it's quite basic. This line actually runs free Radios Gateway. In order to use the Gateway, you also need to create a user for S3, and this will generate automatically the secret and the key that we need in order to access the cluster. This is the example for Swift. Swift will have sub-user because there's a tenant, but it's quite simple. This is the main component of Radios Gateway. The idea is to make it as layered as possible, but like always, sometimes you need to skip layers. That's the way it happens. So first of all, Radios Gateway, we want HTTP, so we need some way to actually access Radios within HTTP. So that's the front-end. It's actually implement HTTP. You have two ways. First way is using an external web server, like Apache or NGNX. We have the fastest GI. You need to enable the fastest GI in the Apache and integrate with it. Then we have an embedded web server, a bit simpler, Swift web, and in that case, you don't need anything. You just run the gateway and you run the web server inside. We have the layer to actually translate the dialect because all in all, the different REST API are different, but they are similar. So we don't want to duplicate code. So we have layer for each, translating each dialect. And underneath, the execution layer is common, and when you create an object, it creates almost the same for Swift or for S3. In that way, it means that you can also access objects with different dialects. You create an object in S3 and actually, you already boot Swift. If you have users, you need the same user. Or the other way. We have a special layer we call RGW-Redis. That actually maintains the data we need in Redis. We'll go a bit more to detail about all the things we need, but we support really, really large objects because Swift limitation is five terra per object. So that means we need striping of the objects. We need to do atomic override because cloud object storage doesn't allow it to write inside an object. We need to have fast access to buckets to see which objects are in the bucket. That means we need to index the bucket. And it also runs the object class, which is like sort of particular in the RSD. We have a quarter component. We have user and quarter per user and quarter per bucket. Authentication, it's actually more complicated because Swift, NS3, and Google all do different authentication. So we need to handle each one differently and to support different authentication. And because objects are large, we don't want to... We want the liter operation to be efficient. So they are done actually in the background so that way the garbage collection that deletes all the metadata and the object in the background. So actually when you delete an object, the operation finish immediately, but it will take time for the space to free. Citizens. So we need to keep user data, bucket data, object and SCLs per user. So why did we just do the web server and then translate maybe the REST API to command, why do we need the REST gateway? The main difference is the product limits the object size. It's a few giga. Objects are mutable and they are not indexed inside the pool. And you don't have a CL for a per object like the cloud source needs. So if you compare it to REST, we have really large object, few teras. They are mutable. They are indexed to bucket, so listing the content of bucket would be fast and we need all the permissions to be flexible. So first problem is large object. How would we support really, really large objects? We want that small object access would be fast. We want to have fast access to the metadata when you list content of bucket or stats on an object, but we still want to keep really, really large objects. So how do we do it? So object is actually not one reduce object. It's several. There will always be head object. It will contain all the metadata and the attributes. Some are user also. And it's like a manifest. Swift also has a manifest for its large object. And for small object, the data can be in the head up to half a mega. And then you have a tell. It could be one or more object or zero actually for small object. Where this object, and that's the rest of the data. The strike date. Usually the strike size is four mega, but you can configure it to be larger or smaller. Now we have the naming. So we need to identify the object head. So it's just the bucket index, a bucket instant ID and the object name. This is object two. And the bucket is ID is final trajectory. We have a bucket ID because user can rename the bucket. So we cannot use the bucket name as an identified file, but each bucket has a unique ID and we use that. And the tell will look quite similar. It will start with the bucket ID, but then something French. It's a unique ID. And then numbering means part one, part two, and so on. In order to actually list which object are in the bucket, the bucket, we cannot just have a... We need some way, something like similar to directory, a way to actually see the list of all the objects inside this bucket. We use the key value that Reducer gives us. So inside the bucket index, we have a key map value of the objects. And that's where we can list buckets quite quickly. Because we allow a large number of objects, we need some time to chart the bucket index, and there will actually be several objects charted. Object creation... Now, because the object is not one Reducer object, we need to think about the creation of creation. So we actually need to create the head, create the tell object, how many we need, and to update the bucket index. It's quite similar to create a file. It has the same problem when you create a file. You also need to update the directory. And in case of Reducer, we don't have an atomic compression of several different objects. But we need to have consistency. If we, let's say, start and create the object, update the bucket index, but not create the head, then the user can list objects in the bucket and see an object that doesn't exist, but will not have the object. And in other case, if we create the head first, but not update the bucket index, then the user will see it when he lists the bucket, but he can actually have an object. So this is... So we do something like two-phase commit. We first create the tail of the bucket, because that's something users cannot see. And if there will be some error, the garbage collection can clean it later. Then we add an entry to the bucket index, but it's not a full entry. We mark it as prepared. Then we write the head. And only after everything is complete and we know that the head is complete, we change the object to be complete. And then it's a regular object. This way we can handle all failure in each stage. The next part is quota. So we are talking about the distributed system. We have quota per user and quota per bucket. So that means every time a user writes an object, we need to update its quota. And it could be from any gateway in the cluster. And we have several. So if we... Each gateway would actually update the quota itself, then we need some locking. Here comes the radius classes. So instead of actually doing... Update the quota every time an object is written in the gateway, it's done in the OSD. So when the objects you write to the object, it runs the code and the code updates the quota. And then we check that when we read the quota, it's already updated. And this way we don't need to do any locking. And it's completely distributed and it can scale. To make things a bit complicated, we need performance, even if... So we have, like, user data is accessed all the time. Every time you actually do any request, you do some authentication. That's what we need to see if the user has the permission to check the key secret if it's okay. And also the bucket is always accessed every time you add an object or even produce a list. So we have a cache layer. Each radius gateway has its own cache. But again, we are distributed. So we need some way to invalidate the cache. Luckily, Redis has watch-notify mechanism. That means that in every metadata we have in the cache, we register on it and every time it will be updated, there will be an event and the gateway can update it and invalidate it. So, multi-site. So currently, actually, multi-site is what we call a girl application of Redis gateway. We allow to have two different girl graphical zone, a region, we call it regions. Two different regions, be apart. And actually you can write one and see the changes in the other for disaster recovery. We call it multi-site. This is the core implementation. So a region is a girl graphical logical, logical girl graphic place. You have the east and west, for example. There are different regions because there are different Redis clusters. And in each region, we can have one or more zones. If you have, in this example, for example, we have one region that has one zone and a different region that also has one zone. And you have a replication between east to west. And you can actually add the third region. So in every setup, we have one master region. This is the region that has all the metadata. And we have, in each region, one master zone. This is the active zone. If you have several other zones in the same region, they will be replicated, but they will be read-only. They will be passive. Let's talk about how you actually set up the setup. So you need to create a region to say if it's a master, it's all done in JSON. And you need to update what you call region map and create a zone also and update the zone map. And you need to create user for its zone and update all the zones. And we start the gateway. That's the way it's done today. It's complicated. It works fine, but it's complicated. So we actually, what we're doing today, is trying to simplify everything and make it like more simple commands. It's working progress, so I can't display it yet, but it will be much simpler. So today we have the sync engine. It's an external utility. It's written in Python, and it does actually the synchronization between the different regions. So we have, let's say, we need to sync between IS-1 and IS-2. So the sync agent reads the metadata from IS-1, and then we see, send the data, the metadata to IS-2. And in case of the zone, it's actually tells IS-2, please read the data from IS-1 and sync with it. This is the first time we have a full sync, and then it continues to do incremental sync all the way when you have updates. In this architecture, we can only support active-passive because of the way it would build. Anyone used the sync engine? No? So the sync agent today is fine, but you only support active-passive. We actually want active-active. So an external utility cannot handle it, so we are actually writing everything again inside the gateway. This hopefully will be a tech preview in the next version in Juul, but I still don't know. So that's the problem with active-passive. So first of all, what's next? I think we need to add. We need the multi-tenancy. A tenet is a swift as a notion of tenet. Tenet is like a big user, a super user, that all the user can share the buckets between if they are the same tenets. So we add multi-tenets support. We add object exploration. That means you actually can say that if you give a data time for an object, and after that time, the object is deleted automatically. This way you can save space, and after a moment you're deleted. We're going to support AWS 4. Next version. NFS. So user requested NFS on RGW, which is not great fit because RGW is object storage. It's not file system. So in order to implement NFS, we choose to use NFS Ganesha, which there will be a talk later today. But NFS Ganesha needs some API to actually talk with the gateway, and those APIs should be file-based and allow operation. It looks like file system. So we created a LibRGW. It's a library. We could say it's similar to LibRBD, so you can actually talk with RGW directly. But it has the API that looks like file and directory API. You can do lookup. You can do, like, redo. It's all emulated on the NIF. And this way, Ganesha can actually talk with LibRGW to the gateway. It's one process. It's a library, so you link Ganesha together. And this way you'll have NFS by using RGW. Static Reptiles, we do main support. Keystone v3 opens stack as a new version. We need to integrate, so there will be Keystone v3 support. And we're going to support Swift large objects, which is as a limit of a few gigabytes, if I remember per object. So Swift large object, actually, it's the way they stripped the object in order to support large objects. We're going to support both static large object and dynamic large object. And we implemented multi-site. There will be a new implementation. It will be, hopefully, a simpler configuration. There was an email a few weeks ago showing an example. If everybody wants to look at the safety belt, you'll see the new API. It will be much simpler. No JSON. We were requested to rename region to zone group because it's small. And it will have active, active support because it will be inside the gateway. And that's it. Any questions? Yes? It will be in dual. I hope so. Yes, it's very short time, but it should be. It's in review. Any more questions? I have a... Any more questions? Safe journey, you can ask on other stuff. Upload me the slides, please. Ah! I'll be there a little bit later. So, I'm going to meet John. So, I wanted to, maybe I will, sometime, like, ask him, how are you? I'm already good. So maybe I'll just, you know, Hello. Hi. So, we are, are you using? Using. It's hard, yeah, you can say. It's hard name for the partners. Thank you. Good job. Thank you. No, it's very good job. Thank you very much. Thank you. I think we're new to this one. No, some people don't know. Can you kind of... Yeah. This is... This is the thing. This is right there. All right, what's up, man? If we could... Before we start, we wanted to do a test for another presentation to see if the video format is okay. So after everything is set up for you, just quickly... Yeah, use the VGA to test. You go and test now. Test now. I'm sorry, I'm already addicted. Go for it. I don't expect you to have it. But yeah, that would be better than... Oh, great. Thank you. Can you push it back? Well, you have to do it. It's going to be... What you see there is what you can see. Okay. Again, what you see there is what you can see. If it's gonna... Well, it's a little bit kind of like more here. So... It's only here. Yeah, but it is quite... Okay. You will be on the stage and shoot... Okay. Why is this big? I mean, just to be pleasant to the eye. No problem. Have fun tomorrow. Do you have any scarves for the questioners? Sure. So, they are pretty, but if you won't... You can give more. Okay. Okay. Where are the reporters? On YouTube. So, if you go on YouTube, the Red Hat CZ, or no, Red Hat Check, YouTube channel, you'll see the streams and the recordings after them. Thank you. I wonder where they work. They're on the website, I guess. I think there's stream... Yeah, there's one stream. It's actually a very weird kind of stream because it's one video, one stream, for all the rooms at the same time. For example, on my mobile, I can't select through the streams. I only see the first stream if I want to go to another room, I can't. So, do you see the new technology with the camera? Could be. It was like that last year. So, it's awesome. I mean, the idea is awesome. So, for example, if you want to share the video with somebody, you can't just go directly to the layout. I never use this. Yeah. I'm connected to one layout. Okay. Is this off? No. There we go. That looks better. Okay. So, we... We... It's finished. Really? There's an accent? Yeah, there's an accent like that. Only to the right? Yeah. My Spanish is not that great. Okay. Anything you would like me to say in introduction? Just... I'll give it over to you. Okay. Also, look at this. Well, the previous presentation was done earlier. So, it should have ended now. Okay. Wait till... Nine more minutes. We have nine more minutes. Break, right? We'll wait another eight minutes at least for people to come in. Okay. I have to move the ground. Unfortunately, it can't... It won't. It's not that long. I have the same problem as today. Lift it. So, I don't... What do you want to do? Lift this. Oh, I guess I know your song. Oh, let's see. Another way of doing it. Yes. Ah, we just started singing. Coming in left. So, I just know that this is your part. That's right. Yes. You can consider it a part of your life. You know, my wife is a group. So, I don't want to be... So, I don't want to be... Okay. So, I don't want to be... So, I don't want to be... Ah. Okay. Okay. Okay.