 All right. Welcome everybody to the last session in the STS DEF room and we're going to hear about, you know, every file system's worst nightmare. Give it up for Alexandre and... Alexandre? No, Romain. Thank you. So, I'm Romain and I'll be representing this talk with Alexandre, which is here and we've been working on object storage for a few years at OEH. The subject of today is storing a lot of small objects in your Swift cluster and the way we optimized it. If we are talking about optimization, it means we had some issues. What were they? Well, first, the most obvious, we had performance issues. We saw that especially on latency when a user requested an object. If it gets it in 30 milliseconds, that's okay. If it takes hundreds of milliseconds or either seconds, it's not good. We also had issues with the replication and reconstruction process in Swift. For example, if you replace a failing hard drive, you need to rebuild the data on it and it was very, very slow. And finally, we observed that our disks were always 100% busy, which, if you think of it, is the root cause of the two points right above. So, first of all, I'm going to explain quickly how data are stored in a Swift cluster. There is two way of storing data in Swift cluster. The first one is replication and the second one is error coding. Replication, it's pretty simple. It's like a red one on your servers. If you upload one object, in this example, 6 bytes, it's going to be written multiple times in your cluster. On the top, I show an example of the way data are stored on your server. Swift store object on the file in an XFS file system. I mean, XFS is a recommended file system. It can work with some others. The first part is the moon point. So, it's all under SRV node and the device. And after that, starting from the right, the file name is based on the timestamp, which is a timestamp when you uploaded the object. After that, you have a hash, which is kind of an idea of the object. It's based on the name of the object and some other information, a content container. And you have also partition and suffix, which are derivated from the hash. For those of you that knows SEF, a partition is like a placement group, and suffix is just a part of a partition. So, this is the first way. Replication, like a red one. The second one, error coding is like a red five when you upload an object. See if the same example with 6 bytes is going to be split in fragments. And to have the redundancy that you are expected from your Swift cluster, we will add another fragment, which is a parity fragment. So, in this example, there is three data fragments and one parity fragment. You can choose the number you want. In production, we run with 12, at OVH, I mean, we run with 12 data fragments and three parity fragments. So, a quick comparison. With replica, you will get the performance, because you need only one connection to access to your data on the object server. But if you have multiple replicas, you can open many connections to many servers. But you will get the overhead. For example, if you have three replicas, you will store 12 bytes for one object of 4 bytes. While error decoding, it's cost-effective, but it's kind of slow because you need to open many connections and then do the mathematical operation to rebuild the object. On our production, we have three replicas and 12 plus three fragments, as I told you before. So, it means that on the cluster for replicas, we have three files per object. While with error decode, we have 15 files per object. So, there is already, you see a difference, five times more files when using error decode. And this is where we will be talking about inodes. On XFS, every file, I mean XFS, but almost all file systems, one file means one inode. Also, one directory means one inode. So, if you just think back to the way data are stored on file system, we have one inode for the file, one inode for the object directory, which was called hash. So, in replica, we already have six inode per object in the cluster. And in error decode, we have already 30 inode per object. An inode is very useful to get back your data and contain some information like the position of the data on disk, but also all information you use to manage when you're on your server, like creation, modification, access time, the owner of the file or the directory, the permissions on it. It can be ACL, quota, extended attribute, a lot of things. The thing is, in Swift, we don't need that. I mean, we already have the creation time. It's in the file name. The owner is going to be the Swift process. And the permission, well, there is no permission in Swift. I mean, not at the file system level. There is permission all the way, but not here. So we don't really need this. But we have a lot of inode. And this is where we are starting to have issues. One inode with XFS takes about 300 bytes to one kilobyte of memory when it's in the cache. And we have an average of 2.4 inodes per object or fragment. We have one for the data file, one for the object directory. And with a partition and suffix directory, it comes for about 0.4. So our production servers are running 64 gigabytes of memory. We have 36 disks. And on each disk, we have 70 million of inode. I'll let you, 70 million. I'll let you do the calculation, but it does not fit in 64 gigabytes of memory for sure, more like one terabyte, even more. And we didn't want to buy such amount of memory. We have thousands of servers. It would cost so many money. So, yeah, we have memory issues. The inode cannot fit in the memory. But the thing is, when you try to access a file, you need to access every inode of every part of the pass, just to check, for example, the permission. So, we only had the top-level directories that were fitting in the inode cache, which is, when we looked at the XFS stats, only 20% of the inode access were eating the cache. It meant 80% were going to fetch data from disk. And fetching data from disk, well, it's an IOPS. One or many IOPS. So, we had half of the device IOPS capacity only used to fetch inode from disk. It's not what we are expecting. The user wants the data. It doesn't care about the inode. So, yeah, it's slow, really slow. But we also had some stability issues. To be fair, it was on older kernel. At that time, we were running 3.14. Since then, it improved. But we had a lot of file system corruptions. And we were totally unable to repair them on the production seller. Because when you run XFS repair, it allocates one kilobyte of memory per inode. So, XFS repair could not run on 64 gigabytes of memory. And we have production. The memory is used by other process than XFS repair. So, we had just one server with a lot of memory just to run XFS repair. We used to take the disk, put them in this server, run XFS repair, and put the disk back to production. And one XFS repair was taking about two days to run. So, yeah, we have an issue. Good news is Swift is open source. So, hey, why not? We will fix it. We don't need iNOT. We need data. That's all. So, we tried a lot of things. A lot of crazy things, actually. Before finding the right solution, first of all, we thought of storing objects in a key value store, like ROXDB, LabLDB. We quickly found out it was not well-suited because, first, we would need synchronous IO. And with this kind of solution, there is what is called write amplification. It writes more and more data than just what you tried to put inside the key value. And so, no. Not a good idea. After that, we thought we could store in a key value store the file under of the data file. A file under is like a direct access to the iNOT of the file without needing to check every iNOT of every part of the pass. It's like a unique idea on the file system. So, we could take it and access directly the iNOT of the data file. In terms of performance, it's quite good, but the issue is that we have the file system, which has its own structure, and we had this key value, and it was really complicated to keep them in sync in case of a crash, for example. What do we do if, after creating a file, it crashes right before we put it in the key value? So, we couldn't follow this idea. So, we thought, hey, we will patch XFS. Well, they did a pretty good job. It's already well optimized. There were not a lot to save here. And we also looked at ZFS, which is based on a layer called DMU, which is actually an object store. So, we thought we could put our object in this object store. The thing is, it would bring us a lot of cool features, like snapshots, cloning objects, stuff like that. But it would have performance issues if the file system gets full. This is a well-known issue on ZFS file system. And also, it was really low-level development. There is no stable API on ZFS, so it could break at any upgrade, mostly. So, we decided not to follow this idea. So, we thought we don't need inode. We don't need file. So, what do we do? So, one obvious idea, if you want to have less inodes. So, if you want less inodes, you should have less files. So, we ended up doing this where we store objects, continuously in larger file. So, you can see we have a small object header with a file name and metadata information, and then the actual, in green, the actual file content. So, this large file, we will call them volumes from now, but they are just regular files on an XFS file system. Nothing special about them, except that they will be larger. So, let's see how we work with these. A volume is dedicated to a partition, and I mean a swift partition, like Roma said, something like the safe partition. Placement group. So, swift partition. We only ever append data to a volume. We write new object at the end, and we never ever overwrite anything. So, it's kind of like a journal. And because it's append-only, you cannot write concurrently to it. If you do need on one server to write at the same time two objects that should go to the same partitions, then you will write two distinct volumes, two distinct files. So, that's the basic idea. Before we dig further, I want to remind you very quickly about how Swift works without talking about authentication or container servers. I just wanted to say that you cannot contact the object servers directly. Your request will go to one proxy server, which will, in this example, send three copies of your data to three different object servers. And if you need to get your data back, the proxy server will get your object from either one of these three. Our patch or work with this is only on the object server. We didn't touch any other Swift code. So, how does Swift organize data on the object server? So, Romain touched on this earlier. When you send an object, so I simplify this a bit, but Swift will calculate a hash based on the file name and a few other things. So, you get the object hash, its ID in the cluster. From the hash, you will compute the partition. So, how do you do that? You will take a few bits from the beginning of the hash. That's operator configurable. And you will interpret these bits as an integer. And that gives you the partition. So, that's really important because the Swift topology is described by the ring. And the ring will tell you that this partition goes to this and this and this object server. So, we need this. Then, the suffix, just the three last characters of the MD5 as ASCII. We will see why we need this. And finally, the file itself is named with a timestamp, which is the time where you put the data and the data extension. So, eventually, we get this, which Romain showed you already. So, the object is the root directory containing all the files for that given Swift policy. Then, the partition directory. The suffix directory here is there so that you don't get too many directory entries right below the partition. So, that's kind of artificial to avoid having too many entries there. Then, the actual object hash is another directory. And finally, we have our file. So, you can see how this may cause problems if you have many, many small files, too many entries. So, how would that work with our new system? You can see here a new component there, the index server. So, actually, it's not a new server in the sense of a new machine. It's a process that will run on the object server machine alongside object server processes. You will have one index server per disk per policy. So, on the left, it doesn't change. You come with your data to the proxy server, which will contact the object server. And now, that's where the patch applies. So, instead of creating a single file, it will try to find a volume available for the given partition and get write lock, write data at the end of the volume. So, append only, always. Sync data, fsync, or actually fdata sync, in that case. And finally, register the object in the index server. So, what do I mean by that? We wrote our file somewhere in a volume. We need to be able to quickly retrieve that location later. So, we send the index server the name of the object and its location, which is the volume and the offset, where we can find the object in that volume. If the volume doesn't exist, it will be created. If there's only like one and another process is writing, we can create another one. Reading an object, so that's easy. So, proxy server again contacts the object server, which will contact the index server to retrieve the location of our object. So, we get the volume number and the offset within the volume. And we can then just read that. So, now we will zoom in on that new index server component here. So, it's a JRPC server. It's written in Golang. Again, it runs alongside the object server. It's the same failure domain. So, it's on a single disk, basically. So, if you have a machine with 36 drives, you will get several instances of that. It stands data in a key value store. We're using level DB. There are two important characteristics of level DB for us. The first one is that it will order entries based on the key. That's really important for us. We will see why in a minute. And the other interesting property is that we use it with a snappy compression algorithm. So, it makes for a small database. And we want that to remain in memory. So, trust the system, the page cache to keep it in memory if it's small enough. So, the key in that system will be a concatenation of two strings, the object hash and the find name. And the value is the object location. So, again, the volume index because the volume names, the volumes are named with the index in them, and the offset within the volume. So, let's take an example. Here, I have three entries in my data store. So, they are all for different objects. On the left side, in color, you can see the hash. And after that, you can see the find name. So, that's enough to retrieve our object. This was written when we did write the data. And if we want to read it again, the object server, the Python code, we contact the index server and request that I need the location for this one. And I can open it. However, that's not enough because Swift does rely on the directory structure that Robert showed you with the partition and suffix. For example, some replication or auditory job will want to work an entire partition or a suffix within that partition. And we don't store them. But we can compute them. So, for example, if I need to give a list of partitions to a color, I can just take the first bits. I mean, I will work the entire store and take the first bits and compute the partition number. So, you can see that the first two objects will be R in the same partition while the third one is another partition. Send the suffix. Well, that's easier, even. We just take the three last characters of the hash. So, we get that directory level. Below that, you find the entire hash. And eventually, the files themselves. That works even if you have multiple files. For example, in Swift, if you are going to add metadata to an existing object, you do a post and that will translate to a new metafile within that directory. That works also with this scheme. You will get two files in the same directory. So, we can write files. We can get them. We can compute all this directory hierarchy. There's still something missing. Sometimes people like to delete their objects. And so, we handle this with hold-punching. So, I don't know if many of you are familiar with hold-punching. No? A few? Yeah. So, the idea is quite simple. And it's a great feat that some file system managed to offer. So, like XFS does and I think X4 and probably a few others. So, how does that work? If you have a one megabyte file with unrelated data inside, you might decide that you want to discard some of the data from, offset, say, 200 to 300 K. I want to discard 100 kilobytes. So, you can use a system called, system call, which is fallocate with some parameters, which will take these blocks within your file and free them and return them to the file system. So, you will see free space go up in your file system. And the great thing is that the file layout will not change. The file size, if you do ls-l, you will still see one megabyte. So, all these offsets that we stored in the index servers, they are correct. They are still correct. If you do du, you will see 900 kilobytes because you fared 100. So, that might freak some people first, but it did. It works really well. And that's how we can afford to only ever append two files. So, if we apply that to the layout we described before, whenever we want to delete an object, we just punch a hole over all this area. One constraint we set is that you need to, for data to be returned to the file system, you need to be aligned on four kilobytes boundaries. So, we do align this beginning of object on four K boundaries. So, now I will go back to the Python, a little bit, the Python code. So, within the Swift code, we have not patched much of existing code. Some prerequisite patch have been merged upstream by the Swift community. So, now we can have this live mostly alongside existing code. So, it's a new file. It's actually an alternate disk file for those of you familiar with Swift. And it relies on a vfile.py module that gives you a file-like abstraction to work with so that you don't have to patch so much code. And it will communicate with the index server. Remember it's running on the same machine over a Unix domain socket. Quick word about that vfile.py module. So, it provides a file-like interface. If you open a file, you can notice here that the path is a regular file system path. So, you don't have to modify Swift code. But we need this to match the expected layout. If you try to open or create a file in an arbitrary path, it will fail because we won't know how to store it and we won't be able to reconstruct the directories. But keeping that means that we don't have to patch much existing code. Because if I go back to here, even in this new code, we actually base our Python classes on the existing disk file code and we override only what we need. So, as little as possible. So, then once you have your file, you can just read and write like usual. And you can use this there, which works as you would expect from the OS module. So, then, a word about fragmentation. XFS is an extant-based file system. We try to limit the extant count by allocating large blocks for these volumes. But hold-punching is great, but if you do punch in the middle of a large extant, well, XFS does need to create an extant there to represent that. So, you get two extra extants there. So far, there hasn't been a problem for us, but how XFS works is that when you open the file or on the first read, I'm not sure, maybe someone will correct me, but early on it will need to read all extents and build that B3 before you can access any part of the file. So, you want to be careful not to have maybe millions of extants there. So, something we can do is dedicate volumes for files we know will disappear shortly. So, are you familiar with the tombstone files in Swift? If a user wants to delete an object from a Swift cluster, we do not remove the data immediately. We create an empty file with that TS extension, which indicates that from the user point of view, the object is gone, and if it tries to get it, it gets a 404. But we know that the Swift cluster will pretty quickly remove all files, including that empty file. So, we don't want to maybe have too many of these in regular volumes. So, we can send them to dedicated volumes, and when a delete comes, if it's here, we can punch a hole like I just described. If we're on the right, we can just write that, hey, this is gone and do nothing. At some point, we stop writing to the TS volume, we create new ones, and when all the files have been deleted from there, we can remove the entire volume. That's something we could do. We have not had the need yet. The code is there. Write performance or how we manage the safety of the data. So, Swift will obviously, in its regular implementation, it will F-sync the file because you created a new file, you wrote data. Before you return to the user, you want to be sure it's persisted, and if there's a problem, it will not get lost. So, we do something a little similar, but we can use F-data-sync only because the volume already exists. So, that's a little bit cheaper. And then, when we send location information to the index server, that's asynchronous because making it synchronous would destroy performance, which means if we have a kernel crash or power failure at some point, the index server may be a little bit behind what is really on disk. So, how do we handle that? When the system restarts, it will notice that the shutdown was not clean, and we will go through all volumes from the last known offset in the index server and scan for new objects that we are missing. That works because the object header is written at the very end and just before we sync data, and we will add missing entries there. Performance. So, we use about 42, we didn't do it on purpose, 42 bytes per object in the index server, which is less than 300 bytes to 1 kilobyte for an in-memory i-node. So, latency may be slightly worse when you first put your server in production, but pretty quickly it will get much, much better if you have small files, if you have large files, it doesn't change your performance because, yeah. Replicate. So, replicate does not replicate data. For those of you familiar with Swift, you can actually walk through the directory hierarchy on the object server, get all the file names and compute a hash of that, and basically this would be exchanged between object servers so that one will notice that it's missing data and it needs to copy it, and that used to be very costly because as Romain described, since the i-nodes were not fitting in the cache, we were doing so many IOPS that it was very slow. We can serve this from memory, so, yeah, it's much faster. We saved a little space, that's a side effect we didn't mean to, but I guess probably because we are not creating directory i-nodes, and room for improvement, sure. So, the key format, if you recall, is MD5. We do store that as 16 bytes, not 32 bytes, but the file name we had made no effort to optimize yet, like .data, .meta, we could encode this. A few benchmarks that I will let you read, I will not read this. That's obviously for a small, mostly small object, I think it was 16k or 32k. Again, for large object, it will not change much. So, what's next? So, this is available publicly on GitHub, but it's not in Swift upstream, something we may consider doing with the community. It needs a review and more tests and still some work, but it's already available if you want to take a look at it. Store short-lived object in dedicated volume, that we haven't needed it yet. Some part of the code is already there, but it's not activated. Replication of volumes, that may be interesting. Currently, we rely on existing Swift replication mechanism, so it's per object. In many cases, you will still need to do that, but sometimes, if the topology changed, you want to move an entire partition from a machine to another, it would be maybe a good idea to just grab the whole volume and move that and not move several thousand of objects within individually. And the last one is not strictly related, but irreversible coding is not efficient for small, very small objects because the smallest thing you can allocate on a modern drive is 4K. And if you have 15 PCs, that means at least using 15 multiplied by 4K kilobytes. So, something we would like to work on. And finally, before we take question, I would like to thank the OpenStack Swift community who has been very helpful with this project and other patches we have submitted, so thanks. And Facebook for publishing paper, it's quite an old paper now about a project called Hashtag, where they did store some files like that, in larger files, so that gave us initially some ideas. So, thanks a lot, and if you have any questions, we will take them with Roma. Any performance tests on very, very small files, like one, two kilobyte files? So, the question is, have you done any performance tests with very, very small files, one to two kilobyte files? So, honestly, I don't think we have, because I've done most tests with what we've seen production for some of our clusters, which is 16 kilobyte objects. So, 16 and 32, so various size, but not that small. You're about short-lived objects, how do you know if it's a short or a long-lived object? Well, there are, oh, sorry. The question was, can you repeat the question? Sorry. You spoke about the short-lived objects, and so how do you know if it may be a long-lived item or short-lived item? Thanks, so the question is, how do we know if an object will be short-lived? So, the case I mentioned, the tombstone file, we know because that's how Swift works, right? It will create that tombstone file to indicate that the user wants this gone, and for eventual consistency reason, we need that as a file. And in a short time, which is operator configurable, it will be deleted, so we know that. Another case that I didn't mention is that Swift users can use the X-delete-at header to say that, okay, I'm putting this, but I want this gone in 10 days or 10 hours, and we also might want to send this to these non-punchable volumes. You decided to create another volume, like we have a million files? Yeah. Yeah, great question. So, is there a limit on how many volumes we may create for a partition? So, we do have a configurable limit in case, for some reason, we have so many requests coming from the same partition that we would create millions of volume files, and, yes, we have a limit on this, and it will just fail if we hit that limit, and you can change that. The question is, how large are the volumes? So, that's also configurable. We are not sure yet what's the optimal size. I think we are now running with 10 gigabytes. We've tried both. I think past the pointer, it doesn't make sense to try to go too big, because anyway, if you tend to run with full disk, and we do, because economically you have to fill it as much as you can up to a performance problem limit, you will get some fragmentation within the file, and it's less handy to work with your system if the files are too large. So, that's something open for discussion, and that's configurable, but today we use 5 to 10 gigabytes. Also, something we didn't mention, is that at some point we might want to compact the volume if there is too much hole inside, and if the volume is too big, it will take a lot of time to compact it, and we might need to lock it to compact it, so keeping small volumes, you know what I mean, is easier for the compaction. Any other question? Which kind of drives are you using? It depends on our clusters, but... Which kind of drives are you using? It depends on the performance and the kind of storage policy we use. For our public cloud offer, we call object storage, it's 2 terabyte disk, and for what we call cloud archive, it's 6 or 8 terabyte disk. The reason is that for different size, you always get the same budget on IOPS, so if you store more data on the drive, you have the same IOPS for more data, and statistically you have more access if there is more data, so performance are worse. No, we don't show the SSD. Yeah, only... Yes? How do you come to the number of using 12 replicas and 3 erasure coding partitions? So the question is how did we choose the number of 3 replicas and 12 plus 3 fragments for errors of coding? We decided that we wanted an overhead of 1.25 because we want to propose that to our customers, so it's kind of a price reason, and so we chose the number based on that and we did not want to have too many fragments because having many fragments, first of all smaller fragments, and also it would increase the number of connections needed by the proxy servers to fetch data. So it was a good compromise, we made some benchmarks and the number was acceptable for us. So the question is any pointers to the code? There is something on the garrits, but it's not the latest code, the latest code is on github.com. We are making this at the moment more available. Yeah, it's going to be on the Swift garrit at some point. Okay, great, we will do this. Okay, thank you. Thank you.