 Oh, that's fine. So this is Marco, Marco Karg. He's going to talk to us about Ceph. Marco is a software maintenance engineer at Red Hat. Thank you, Alex. Thank you. Thanks. Thank you for the welcome. I'm going to talk about Crushmap and basically placement groups today. Because a while ago, I was supposed to do some Ceph support, and I had to get my head around that. And it wasn't easy. So I thought maybe people might benefit from my work. Hopefully you do. So first disclaimer, I made this talk easy in a couple of aspects. Mostly not talking about replication. It doesn't affect the basic principle. So Ceph, please forgive me on that. So Wolfview has been dealing with Ceph already, and he's running Ceph. So you think you understood Crushmap? Crush actually stands for control replication on the scalable hashing, which is a beast of an acronym, in my opinion. Basically, it means the Ceph clients and the demons, both do the calculations on the algorithm, where to put stuff, how to retrieve it. Also, it means there is no centralized lookup tape. We don't have anything we spread around continuously. Crush enables, due to that, some proper scaling and equal distribution of stuff overall. OSDs, that's clear to everybody. Basically, that's a disk in Ceph terminology. And also it uses a sort of intelligent data replication so that we have resilience provided. Ceph contains or provides a couple of maps, actually five of them. There's a monitor map, which gives us data about monitor OSD map, PgMap, placement group map, the actual Crush map, which is very, very interesting. And the MDS map, which provides data for Ceph Fests, and I will not cover that at all today. So the monitor map has the classic Playoff SID, basically an ID. There are some sort of ranks or positions for the monitors. Of course, you need a name, address, port, and an epoch just to provide the most current version or to define the most current version. Then we have the OSD map. That gives us information about underlying disks, a list of them, status of them, number, and the replica size, and most important, also the Pg number. We'll see later on how that correlates together. Not the very important PgMap, which defines placement groups. There are OSDs. There are OSDs up and running. The current state and also some sort of data usage statistics, which we will not look into. Then there is Crush map. Crush map, actually, if you look at it, it's very easy. Just a list of storage devices, some hierarchy information, like a rack, a device, a host, geolocation, and the rules to traverse that hierarchy when accessing an object, when retrieving an object. That's basically it. But MDS map is not interesting. The challenge we face here is, let's assume we have a couple of thousands or 1,000 OSDs disks. And we want to equally distribute objects on them. We want to restore them from there. We want to make sure we can live with a disk failure. We want to be able to extend the cluster by adding disks. And we want to make sure data is distributed evenly across all those OSDs. So how do we do that? So the first naive approach would be round robin. Pretty easy, first object goes on first disk, second object goes on second disk, and so forth. That works pretty well when it comes to storing objects. But, and there is always a but, what happens if one of the OSDs fails? OSD 5 just failed, and access is no longer possible. You might say that's not a biggie. But what if the client uses the algorithm to calculate the location of the object number five? He will end up with OSD 5, which just vanished. Replication, yeah, right. But that's not the problem here. Second idea, we use a mathematical operation like modulo. So basically that means we have five OSDs again, 100 objects in store. Where does, for example, object seven go? We use modulo on the seven. Gives us five, remain the one. So that object will go to one, sorry. Same for object number three, for example. Goes to three. And so on. That will provide us with a very simple mechanism to evenly spread data across the OSDs, retrieve them, and, of course, store them. Mathematically, the background is just this. That's modulo operation. But that is what the distribution would look like. Object one goes to one, two goes to two, three, three, and so forth. So right now it looks like we found a solution to the problem. Still, what happens if we extend it just with one OSD? Every object from five onwards would have to be relocated just to make sure we can really retrieve it again with the same mechanism. And we would have to give the new number of OSDs to the clients because they now need to deal with six instead of five OSDs. Not really practical. When is the best time to change the number of OSDs and give them out to the clients? What happens if we add another one? And we would have to relocate all the objects, which means huge data movement within the safe cluster. So all of these objects would have to be moved around going up to 100. So we had round robbing, not the best idea. We used model law, better, but still has issues when we add disks. And the failing disk is not solved either in that case. So what can we do about that? What do you do when you don't know what to do? Well, you add another layer of abstraction. In that case, it's called placement groups. Placement groups, again, we use them to store data. Again, we use the model operation, so that approach wasn't bad at all. Again, object 130 would go to placement group 10, 71, 211, and so on. So placement groups are just an abstract concept. There is nothing like a real placement group. You can touch it. And that's the huge advantage. They are constantly in their number. So we don't change the number of OSDs. We don't change the number of placement groups, sorry. While the OSD number may change, may get bigger, may get smaller, disks fail, whatever. So now we need to make sure we access the data via the placement groups. And how does that work? Well, above the placement groups, we define the pools. You've probably seen all pools in SEF. Pools is basically the thing the client actually talks to. If you want to use a pool for block storage, you have one for object storage. You can have a couple of pools, as many as you want, but those pools always have a fixed number of placement groups, and that's the big advantage here because now the client accesses the block storage pool, and all the client knows now is how many placement groups are there for this pool, and the information about the OSDs is completely hidden from the client. So if we add an OSD from the client's perspective, nothing changes. If we remove that OSD, the client doesn't even notice it. The link between the placement groups and the OSDs is provided by a plain list of all PGs, and all OSDs have this list of placement groups. So every OSD basically knows to which OSD it belongs. If a new OSD is attached, it just gets that list, and is available to the placement groups, depending on your replica count, maybe it's needed, maybe it's not needed. Thing is, if you have ever seen a SEF-S output that says pool is degraded, it's probably lacking actual OSDs. So now we have a way that can handle changes in the backend without any effect on the clients. The list of PGs is called the PG map we saw earlier. So how does the crush look up actually work? How does the client find the object? First of all, the client would contact a monitor by just to get a copy of the entire cluster map, all of the maps. Then the client would create a hash of the object he's looking for. Typically, it's a hash function of the name. With that hash function and the number of PGs we know, we get the placement group where the object lives. Then we contact the OSD directly and retrieve the object, and the OSD, or the SEF cluster in the meantime, computes with a crush lookup, the secondary placement group, and the OSDs. I know it sounds complicated, it's not. This is the crush lookup pseudocode. As you can see, first the client retrieves the cluster map, then builds the hash. With that object hash and the number of placement groups, we get the actual placement group where it lives on. Then with that placement group, we do the crush command and the crush lookup, and we get the OSDs, primary and secondary, maybe a tertiary, where the object can be found. Let's do an example of that. Let's assume we want to look up for an object called swim ring. So we would do a hash of swim ring, model the PG number. Sure. Yeah, that's defined in the crush map where you have this hierarchy information and where you put OSDs to different computers, to different regs, and so on. So you make sure those two OSDs are not in the same node, and the SEF takes care of that. Right. Back to the lookup. So with that hash and the number of PGs we have, we get just, let's say, 0x34 hex value. Then we need to know the pool ID, which we get from the PG map again. That pool ID combined with the placement group gives us the locator, 534 in this case, and with 534 we do the crush lookup, the final one, which in turn will give us the OSD20 as a primary one, OSD46 as a secondary, maybe even the third one, OSD59 is just an example, but that's the entire lookup, and as you can see, this is rather easy and it's reproducible and reliant. So what's so special about that lookup? All we need for the entire lookup is the object name and the cluster map, so we can deterministically lookup the object location. There is no shared metadata, which prevents us from running into any bottlenecks, gives us of course great performance. Calculation is mostly, mostly done on the client, except for the crush lookup, so we don't have any impact on the OSDs, and that cluster map is retrieved, or well, before I do a lookup, I check whether I as a client have the current cluster map using the epoch. So the cluster map continuously reflects available components down OSDs and can easily react on failed components, added capacity, OSDs die, we all know that, but clusters tend to grow over time, so we get more OSDs, we get new pools, and that makes it so really easy, basically, to do the lookup. So references, where is this all taken from? Basically from Sagewell's thesis and the paper on the crush lookup, and I guess that's about it. So it's not as complicated as you thought, hopefully. Sure, go ahead. The new OSD will only, oh, I'm sorry, the question is what happens if the cluster is still busy redistributing data to the new OSD, and the cluster, and the client has the new OSD? Well, that is not possible, because the cluster, since the epoch has to be the same, you have the same information on the cluster and on the client, and the cluster will change, the map, once the OSD is up, running, and filled with data and can be used, only then you will update the crush map. Oh, well, yeah, the PG map. Well, Seth, the question is what happens if a disk fails, when will the update of the epoch happen? Seth will detect the disk has failed, will immediately update the maps accordingly so that the client does not even, if the primary object is living there, so that the client does not even go there, but takes the secondary instance, and once that disk is replaced or another OSD is allocated, and that is backfilled, that's what Seth calls it, will again change the maps and give that OSD to the clients. That's a good question, I really don't know. But apparently, Seth somehow takes care of it, I don't know. No more questions? Dan, thank you very much.