 Okay, so I think it's time to get started. It's good to be here. Thanks for having us today. This presentation is about self in OpenStack. I'm Sebastia Hahn, and I work for Red Hat, the Sino-Cloud Architect. So it's a really fuzzy name to say that I'm doing design for cloud platforms, basically. From time to time, I do blogging as well. So it's self-promotion here. And yeah, Josh. My name is Josh Bergen. I've been working on self for about five years now. I'm the technical lead for the block device portion. So I'm more on the engineering side of things. Okay, so today's agenda. So for those of you who are not familiar with self, we're going to go through self, well, really quickly, just to get another view. Then we're going to explain what's new in self, what's new in liberty and beyond, what's new in self, and we're going to do some preview about what's going to happen into the Mitaka cycle. So as I said, for those of you who are not familiar with self, self is a unified, distributed, replicated open source software defined storage solution. Unified because it allows you to consume your data through several ways, object, block, and file system, basically. It's distributed because of the architecture, so we don't have any single point of failure. We have just a bunch of machines. It's replicated because we have replication factors of your data, and it's open source. So with a little bit of background behind self, self is supposed to scale horizontally, so we should not have any single point of failure where you have this single big server and you have all of your clients just accessing that main server too. So it's definitely not how it works with self. The solution must be hardware artistic, and we have to run on community hardware. Self manage whatever is possible, so that means if something breaks, it has to repair. It's open source. And we really want to move behind hold approaches like client server, that's what I mentioned, where you have all of your clients accessing the same server. Now it's more client cluster where you have access to the entire cluster, and then you can write on every single machine that's available. And we don't do any tricky thing with HA, so it's kind of built in into the self components. So a more general overview about self, as you can see, self is built upon a RATOS. So this is the main layer of self where you just basically manage everything. It's your object store, because in self, everything is an object. So just on top of self, we have a lib RATOS where you can just access your data, write your data too. So there are several bindings for several languages, so that means that you can simply build your own app and use the libraries to connect to your cluster and to write all of your data in it. But thankfully we have something already ready for you if you want to do object store. So we have the first one is RATOS Gateway. It's the equivalent of Amazon S3 and OpenStack Swift, basically. So it's a RESTful Gateway where you can just HTTP RESTful interface. So it's support mutating and see zones, replication through zones. Then the second component is RBD, which stands for RATOS block device. And this piece is divided into two where we have the kernel driver. It's a little bit of the equivalent of iSCSI. So basically you can create an image and then you can map it to a host and you just just write it. And the second part is the virtualization plugin for QMU and KVM. So that's what we're using OpenStack. And the latest part is FFS. So it's the distributed file system. Yeah, so everything is really stable. Everything is really robust except FFS for now. But as Sage mentioned yesterday, this, the first stable and production ready version of FFS should appear next year during Q1. So that's really good news. A little bit about RATOS and the SAF components. So we have monitors. These are the, let's say, the brain of your cluster. It's the entity that manages cluster maps. So the topology of your cluster. It does, it provides consensus. So we have a crew mechanism, but it's a crew based mechanism. And that's why we need always an odd number for your monitors too. It's not in a data path because when we, so when we want to access our cluster, we just don't go through the monitors and then to the cluster, we just ask for our monitors to get the map. So we have the topology of the cluster and then we can access all of the cluster machines. That hosts object storage demons. So this is basically one process and this process is just bound to one disk. This entity just basically writes data, replicates data, backfill, whatever it's needed. And yeah, so it's the entity that is managing your data. So a little bit about crush, because crush is definitely what makes stuff so unique. It's a sort of random placement algorithm, which means that we, when a client wants to write data, it doesn't do any lookup on any Ashing table or anything like this. Every single time he wants to write data, it has to compute the location. So we always know, based on several things, we can predict where the data will go. And if something breaks, where the data will move. So this is really flexible and it's topology aware, so you can design things with racks, with data centers, and yeah. So it's safe in OpenStack. So this is the current state of the integration basically. So we know how to work with every single component of OpenStack. For Keystone, you can basically use Radar's gateway with the Swift API. So you simply register an endpoint into Keystone that points to your Radar's gateway. We currently support V2 for Keystone, but since I believe V2 is gonna be deprecated pretty soon, we are currently working on supporting V3. For Cinder, we leave RBD so we can just create devices and then attach them to virtual machines. For Glance, we also use RBD, so we store OpenStack images into Ceph. Same goes for Nova, when you boot an instance, you can basically have your root ephemeral disk leaving inside your Ceph cluster. So we finally have this unified layer where all of your OpenStack components are backed by the same storage entity, and which is kinda nice because there are several mechanisms that we use to speed that up and we're gonna be discussing that a little bit in a second. And I will leave it for to Josh. Thanks, Sebastian. So I'm gonna talk a little bit about what's been happening in terms of OpenStack integration with Ceph and in Ceph itself. So we had a number of bugs fixed in Liberty. I won't go through all these in too much detail, but fixing the throttling code in Nova to make that apply to ephemeral disks like it does for Cinder volumes. A few things to make Cinder itself more robust, retrying deletions when they failed in case of a temporary condition, making sure that long running RBD calls like deletes don't block the Cinder 8 volume thread anymore. So you can have many deletes running in the background threads and it won't affect other operations. Fixing up the clone def options so that you can automatically flatten clones so you don't get a super long chain if you're cloning volumes to another volume into another volume. And fixing up config drives in Nova so that they can be stored in RBD directly instead of needing to be stored on a local disk. This is one of the steps towards diskless compute nodes. There are also some other features and improvements in OpenStack in Liberty. One of the biggest ones was support for volume migration. I'll talk a little bit more about that in a moment. There are several other smaller things. One is being able to report whether we support discard or trim requests, which is what file systems or block devices send down to RBD or SSDs to say that they're not using this space anymore and the underlying storage can get rid of it. This isn't actually supported in Nova entirely yet, so it's not quite ready for use, but it's on its way. In Nova's glance and Cinder, we added the ability to use the default feature set for an RBD image based on your self.com settings. So this is instead of hard coding it. Now that we've started adding more feature bits to RBD itself, you can get the latest things and use them without having to change OpenStack. And like I mentioned before, there's some more robustness fixes for Cinder, retrying if the connection times out, and trying all the locations for a glance image in case one of them is cloneable, but others are not. And finally, it was supporting multiple clusters in Cinder, being able to pass the cluster name to some calls that didn't have it there before. That's more of a bug fix through. So that's kind of an overview of what's happening in Liberty, but let's talk a little bit more about volume migration. So this is one of the larger features in Cinder. This lets you copy a volume from any kind of Cinder backend to any other kind of Cinder backend. This might be from one pool in Cef, perhaps from hard disks to a separate pool and has all this SSDs, if you want to improve the performance of volume, or you can be migrating it from one storage backend to another, for example, from LVM into Cef. Now, right now this mainly works offline, so you can't do it while a guest is actually using the volume. But in the future, we're looking at making it work more in the online case by doing a little bit of a few tricks through QMU. So the guest is actually aware, or the guest itself, but is not aware, but QMU, the hypervisor, is aware of what's going on. So it can work while it's in use. So let's talk a bit more about what's new in Cef. We were just about to release the new Infernalis release. The RCE went out, I think, last week or two weeks ago, and we're getting pretty close to the final release now. There are several new things done on RVD itself. There is per image metadata. This lets you kind of associate arbitrary strings of key value pairs that you can, with images. But it also lets you store persistent configuration options for RVD. So if you have images that are used for some kind of workload that benefits, for example, from a large cache size, you can store that cache size in that image metadata, and it'll always be applied whenever the image is used. So you don't need to worry about kind of configuring cache size for across all images. You can specify it for individual ones instead. One of the newer feature bits is called deflatting. This lets you, when you have an RVD image and you've created a clone of it, in the past, you can flatten the clone. But if you had taken snapshots of that clone before flattening it, the snapshots would still reference the parent image. So with a deflatting, when you do a flattening on the image that supports deflatting, it'll actually flatten all the snapshots as well. So if you create a clone of a parent image, snapshot it a bunch of times, and then flatten it, there are no more references to the parent image, and you can just delete the parent image if there's nothing else using it. This makes managing clones and parents a bit simpler. One of the newer features in Hammer was object map support for RVD. This keeps track of which individual objects in an RVD image, which are usually striped in four megabyte chunks, actually exist. So this allows us to speed up all kinds of operations. I got IOTA clones, since we can tell exactly which clone we need to read from, or deleting images which have almost no data in them. Also, now in Infernalis allows us to speed up differential snapshots so we can keep track of which individual objects changed between snapshots and generate the diff based on this small bit of object map metadata rather than going and reading the entire image or the entire snapshot. Another new feature in Infernalis is enabling these new features like object map or flattening, deflatting, or exclusive lock from Hammer on existing images. So you can take your pre-existing images that are even using with OpenStack, and you can go through and add all these new features to them, not while they're actually in use, but at least without having to copy all the data. The final thing that other map makes quite fast is being able to tell exactly how much space is used in a given image or snapshot. So since we're keeping track of which objects exist in a given snapshot and which ones change between different snapshots, we can very quickly tell you with the new RBDTU command how much space a given snapshot or image uses without having to go and query anything else in the cluster, just this object map metadata. And that might be used in the future as well for reporting back into OpenStack about how much base is actually used versus how much is simply provisioned. There's also been more work on the baseline for RBD mirroring, which is the big multi-site feature we're working on for RBD. We're not quite there yet, but there's more groundwork going on in container finalists. So that's all, but most of the RBD features in container finalists. There's a whole bunch of other things in stuff. I'll talk about a few of them here. The Radius Gateway got support for the Swift API for RBD exploration. I didn't know it was such a popular feature. That's great. So yeah, objects can expire automatically from the gateway. They'll be garbage collected later, just like a hand-out garbage collection for other deleted objects. There's a whole bunch of performance work that went into the finalists. If you saw the talks that stuff collaboration did yesterday, you'll see that folks like Intel, Samsung, and Sandisk have been doing lots and lots of performance work, really improving the read path, especially so far, and making significant strides in the write path more recently. There's also been some work on the cache curing capabilities of stuff to make that more efficient. So in the finalists, certain types of writes, like ones that are creating objects or appending to objects, can be proxied to the base pool so that they don't need to promote objects into the base pool to write to them anymore. Also, since many, many distributions nowadays are switching to system D, we finally supported system D in self and have those scripts available in the finalists. And trust me, it's still using upstart, so upstart scripts are still on the staff tree as well. One of the other newer things in staff in general is the racial coding. And this has a plug-in architecture. So if you've been working on the checker racial code, which gives you slightly more space usage than standard racial codes, but a much faster recovery in case of failures. Speaking of recovery, we have a lot better defaults in the finalists for the recovery settings. Things like how many backpuzzle going on at once in OSDs, or how much recovery is impacting client IO in general, the settings are much more appropriate for default use. In the past, we've found that many, many people were changing these settings, so we just adopted what they were doing and made them the default. That makes a lot more sense. There's still more work to be done in terms of IO prioritization. So in the finalists, we have unified the queue of IO that's coming from clients and handling internal tasks like recovery, deleting old snapshots. And that allows us to more finely balance the storage performance usage between them. And the future will probably make this even more featureful and be able to possibly even have some quality of service between different clients. And finally, the last thing I want to talk about that's new in Infernalis is the improved pool quota and full handling for the cluster. This is a bit of just my general robustness. When the cluster becomes full, it stops accepting rights and the clients will simply wait for the cluster to become unfilled before sending rights again. In the past, there were some issues with pool quotas, in particular with RBD, because the storage cluster would actually send errors back. And the block layer doesn't like errors very much. It ends up turning your file systems read only. So now, the clients will simply wait and resend once the pool quota is increased or other data is deleted. So let's talk a bit about the next slide for this, Jool. There are a whole bunch of changes happening in Jool. Several big ticket features. I'll talk about a few now. So one of the major ones for OpenStack use cases is already mirroring, which I mentioned a bit earlier. This is asynchronous replication from one site to another. So rather than you have one site, perhaps, with three copies here, another site with three copies, maybe only two, since you're not as worried about the data being lost there. But it's basically the idea is that you use this for disaster recovery. So you can have a constantly streaming, asynchronous replica of all your RBD images in a second site and flip over to using it, and all of your images will be in a consistent state. Even if they might require some file system recovery, they'll be able to do that. There won't be any corruption, and you don't have to deal with things like any scripts around shipping snapshots or anything like that. I'll be handled by a new RBD mirroring daemon. Another aspect of multi-site work that we're doing is to make the existing multi-site support of other RBDs gateway, which currently uses a separate Python agent script to do the synchronization, easier to use, one by redoing how we configure it to make the configuration much simpler, and also putting the actual replication into the gateway itself. And also enabling it to work in active, active fashion so that you can be writing to multiple clusters with the latest gateway and access your objects from anywhere. So the big things in SufficeFest that Syed was talking about a bit yesterday are the file system check and repair tools. These are the last remaining pieces in SufficeFest before we're ready to declare our production ready. So these are what folks have been working on very curiously. Kind of an interesting thing that is in the research phases now is doing more quality of service based on different clients having different guarantees for IOPS. For example, one client might be guaranteed to have 100 IOPS, another client might just be given best effort. And doing this in a way that is very scalable and works well in distributed system like Suff. So watch that space. There might be more interesting things coming out there. Probably not in Juul, but perhaps a prototype or an experimental version. And maybe later in the M release, we'll see that, or the L release, we'll see that become more of a possibility. For a while now, if you've been using Suff, you may see that it uses many, many threads. And in fact, this number of threads increases with the number of OSDs you have and the number of connections that you make to the OSDs. This is because of the way the Suff networking layer or the messenger is written. So one of the things that is going to be, it improves that quite a bit, is an asynchronous messenger that uses a constant size or dynamically growing and shrinking thread pool rather than assigning its particular threads to the particular connections. So that's been there for a little while now, but we're stabilizing it more, making sure it's more battle-hardened. And hopefully it'll be stable enough to use in Juul. It also tends to improve performance a bit in certain cases, so hopefully it'll end up being strictly better than Suff messenger, the existing one. And of course, there are plenty of performance improvements going on. Many folks right now are still focusing on write performance and say just redesigning the way that the OSD stores data on disk rather than going through a file system and a separate journal. For many operations, we can actually avoid the journal and not have that double write penalty. And there's more discussions about this ongoing and the whole design is kind of a bit a little bit in flux, but it may even bypass the file system altogether and just get so we can get the full performance of the disk. Okay, thanks, Josh. So now we're gonna go through the last part of the talk and give you some Mitaka preview. So if you follow me a little bit, you will know that I'm really excited about this feature. We basically call this no-by-firmware snapshots for root devices. The basic issue is that when you take a snapshot of your instance, it just uses QMU and then it scans all of the, sorry, do it again, it scans the entire device and it's still locally on the hypervisor and then it gets streamed in two glans. So basically it's a really long operation and it's not really efficient as well. So the main problem was that when you run, let's say a public cloud or where you don't really control what is happening into your cloud, you might have several users doing snapshots and if you wanna do this class compute nodes, then you can't because you have to reserve a certain amount of space within your compute nodes just to save those snapshots before they get uploaded into CIF. So with that feature, we don't need that anymore and I will go into some of the details in the next slide but this is really something that we have been struggling a little bit to get into Nova but now we have the spec and we have the code as well so hopefully and I really hope that this will get into Mitaka because this will, as I said, this will allow us to do this class compute nodes or at least a really tiny route for your compute nodes. So if something is not configured properly though, you might have to specify the snapshot directory path just in case but usually if it's well configured you shouldn't have to do this but just in case. Hopefully you will be able to see everything but so this is basically what's happening under the hood when you do a snapshot. So what we do here is instead of using QMU and scanning the entire device, we do snapshots on the RBD level so this also assumes that your instance is already living in CIF of course so well let's, first step, the user says, okay, Nova image create of PyVM so then we begin the snapshotting process. Basically what's happening is that we create an RBD snapshot of that instance, we protect that snapshot because we have to clone it later so we clone this snapshot but we clone this in two glance so the way it's configured, your instance is living into a specific pool, it's a Nova pool and all of your images are stored into another pool so what we do is we clone this snapshot into the glance pool and we flatten that image so we basically break the chain dependency with the parent. In the meantime, well when it's done we just unprotect the snapshot of the instance and we remove the snapshot because we just don't need it anymore and on the glance side, normal things we just create a snapshot of that image and then we protect it because later if you want to boot another image we will have to create a new snapshot. This is one of the feature that we use currently it's copy and write cloning so basically when you boot an instance if the image is stored in glance and if the backend configured in Nova is self you just clone the glance image and then this is your Nova instance so this is really fast. So with this scenario you don't really need to have any space available on your compute node and so this is really efficient and this is really fast. Okay, so some future OpenStack improvements so that might happen into Mitaka but I'm not quite sure at the moment. We would like to have the ability to attach the same volume to several instances. Let's say you have a use case where you have an application that is not writing any data but only reading data you just have one volume with all of your data and you can just attach that volume to all of your instances but yeah, of course you want the device in read only. We want to optimize the volume migration because at the moment we if your backend is if you have two types configured and then they are both self and they point to different pools we don't really do a direct copy within self we it's file handler so it's done in Python and it's not well it works but it's not as efficient as it could be by directly using libraries and so this is something that we definitely want to improve. When we create an image from volume as from various as well we basically need a bunch of space on this in the node and we don't really want that and so that's something that we have to work as well. Fin provisioning reporting because at the moment we just if you create a senior volume for one gig and then you get one gig used but what we want to do is to have Fin provisioning reporting as I said and we will just report the real space used by the the block device instead of the entire space. We want to support for us detach for let's say something crashed and then we want to be able to detach the volume and reattach it to another instance. Online volume migration from from self to self that's what Josh mentioned earlier and we want to enable volume encryption with QMU and yeah one final word to say that we won't say this enough if you want to configure self with OpenStack this is the documentation that you have to use and we highly it's always up to date with every single new OpenStack versions so definitely don't hesitate to have a look at that documentation and that's the proper way to to configure your environment. So that's all we have now. Thanks for your kind attention and I think we have time we have like yes 12 minutes to to get questions. Any question? I think we have the, yeah we have the mic. I don't know how to turn it on. Okay, I can repeat the question as well I think this will likely happen next year but I don't have any much detail I think it's being done by someone from the community maybe Neil, you know, a little bit more. Oh the question was do we have a clear state of the of the integration of Keystone V3 with Redis Gateway? I know it's been working on. Okay, ready for dual. So not the next version but the one after. Okay, sorry. Okay, how much time it takes to flatten? Well, I think it depends on the size of the image. Yeah, so flattening involves actually copying all the data from the parent to make it independent entirely so depending on the size of the image and how much data is changed since the clone was made. Yes, yes. Well, yes, but in the meantime we don't really have the choice we can't really afford to have that image still being a clone of the snapshot because that means that we can't really delete the instance anymore so we really have to flatten that image and so about the time it will take I don't really know, but it just depends on the size but it happens in the background already so it's not a blocking operation anyway. Yeah, it's not, it'll happen in the background so it won't block anything but it might take a little bit of time for the snapshot to become available to use. The snapshot is immediate, right? So from a user's perspective the snapshot won't become actually available to use until the flatten is finished so it takes however long a flatten takes. Yeah. Okay. Try the mic, please. I believe a lot of us would be having legacy hardware before the cloud days and suppose we have to migrate those legacy hard drives say many SAS drives, 15K drives which are currently directly attached to the system using RAID controllers suppose we have to bring all those under the umbrella of Chef then how much overhead does the networking or the RBD protocol introduces the question is can it get at par, the performance level or the IOPS can they get at par as direct attached storage or what extra hardware can be put in in form of cache tearing or journaling so that those old hardware can get utilized in the cloud ecosystem. Yes, you can use whatever hardware you want and if the question is how can I speed that up a little bit then you might have to consider using SSDs for Chef for your journals because basically when you use a journal so every single write gets through a journal and then it's flushed back to the file system so if you have OSDs you just configure the journal to be on an SSD and then the real data are just on the SAS disk. There was a particular test case published probably in the Chef's website itself which was about migrating Postgres SQL into a Chef cluster. So that relational databases obviously requires too much of IOPS and bandwidth so there the final comment was that it was not up to the par of course that was probably year or two back. Yeah, so a year or two back Chef's IOPS performance in general was much worse than it is today so I think you want to try again today and you'll see better performance there and depending on how many IOPS you need you might actually want to go to a full SSD rather than suing separate SSD and legacy disks. Another thing that you can try of course is using some kind of caching on the OSD itself where you have an SSD uses a cache with the BE cache or DM cache one of these things which helps quite a bit for rework loads of course. We could attempt mixing up SSDs and the legacy SAS hardware the spinners I mean. Yeah, yeah. It's not ideal but you can do it. Sure. This is regarding RBD mirroring so does it use snapshots in the background? No, so the way RBD mirroring works is it's going to be writing a journal of all the rights that come in and all the metadata changes for a given image to a journal of rados objects that way we can have a consistent point in time no matter when, as we're streaming it out since we're keeping the rights in order it's always a consistent state that the guest was actually in at some point it's not ever an inconsistent state so if you are doing a snapshot shifting approach you always have periods of time where if you tried copying you would have periods between snapshots where you would have some kind of state that maybe is not recoverable very easily from the guest perspective. So you will be copying the journal you are mirroring the journal and then replaying it on the remote side. Yeah, so we're just replaying the journal on the remote side. How does it handle multiple, say, multiple volumes? So each volume has its own journal. AVM has two volumes. And how do you keep the order between the rights happening, you know? Yeah, so if you have multiple volumes that are being used together and you want to make sure that they are all... We don't take care of it yet. That's a feature called consistency groups, essentially, which is another part that we add on once we are finished with the basic mirroring. Nice. Hey, do you know if in Mitaka or another release there's the ability to configure the number of IO threads in Nova? I think it defaults to one. This is something that we also have been struggling for about, like, a year or two to get... It's when you use the... Sorry. It's a big ice-cozy controller for a QMU. And then you want to have more than one thread to do IOs. I don't really have any visibility on that, but I would really like to have it. I think there are... There were two patches. One got abandoned and I'm not really sure about the state of the last one, but the way they want to do it is way too complex, I believe, but it's the way they do it in Nova. So... Yeah. Not sure if that's... Because I think they even did a spec for that. So you see how it goes. All right. Thanks. Okay. First, thank you for changing the defaults for the recovery traffic and rebalancing traffic. It was always driving me a little bit nuts. When I first set up my cluster, that was something I... A problem I experienced was the defaults when I did a recovery. Took out my client traffic. But speaking of defaults, I know I've heard from several of my colleagues and myself that turning off all the debug logging can give me substantial performance improvements. Have you guys considered maybe turning that down by turning those debug logs off by default? Yeah, I think we've thought about that a bit. I think we probably might want to consider doing that for jewel, but we also want to consider just making the debug logging more efficient in general so it doesn't have that much overhead because it's really useful to have those things if it suddenly does crash, at least in memory logs, so you can see what happened. Right. All right. Thank you. Sure. Is there any expected data or something like that for persistent RPD caching? Yeah, there's some interesting persistent RPD caching. Some folks from Antelope are interested in working on this. First, I'd say probably a single right back cache for a single guest and potentially in the future as a shared cache among several guests. Okay. And something about this usage on the journal. Sorry, can you repeat that? Yeah. How far from using the journal because in the past we identified that the journal gets full and what we identified was drops in the network, something to monitor if the journal is right or something on that. So they're talking about the journal becoming full and you see some kind of traffic because it has to be left to do. We saw spikes in the network so it was because of the network and actually it was because the journal size was smaller. So if you're seeing those kind of spikes generally you want to tune the way that the file store is doing syncs. So it's either doing syncs and not bashing things up enough but in general it's harder to avoid those kinds of spikes even with tuning on hard disks than it is with SSDs of course. The problem is if the journal gets full you will go up the SAS or SATA speed so something to at least have an idea of how are you working and what would be the size of the journal that would fit for you. As far as I know there is a way to know the exact usage of the journal. Something like that would be interesting for sizing and tuning. I think you don't really need that from the get go you're a little bit more generous of the size of the journal. I think this problem can be solved really easily if you just get because there are just a couple of things to take into account just like network speed and the speed of the drive. So if you know that you can just properly configure your journal anyway. I agree that this could be interesting but I think it's if you just configure the proper size of the journal give it a little bit more you shouldn't have that problem anymore. I don't think we have much time do we? No? Thank you very much.