 All right, welcome to the final session of the seftey track here at open source days. Go ahead and take your seat when you get started. Just a reminder, if you do have questions, either in the middle or at the end, to make sure you use the microphones so that the recording that we have going for posterity will pick you up as well as the rest of us in the room. Our next and final speaker is the creator and current project lead for SAF, Sagewile. He's going to talk about some of the more recent work that's been going into the upcoming release of Luminous and a few other things that we have going on. So, Sage? Thanks. Can everyone hear me okay? Yes. All right. My name is Sagewile. I'm the SAF project lead. I work at Red Hat in the Office of Technology and Overseas SAF development and so on. So, I'm going to talk a bit about the release that's about to come out, Luminous, what's in it, what's cool, and then I'm going to talk about what we're working on after that and then a bit about contributors, stats, and so on. So, just a level set. SAF does a regular release cadence. We do, normally we do releases every six months. Every other one is an LTS, which means we do backports for bug fixes and so on. Luminous is about to come out. It's supposed to be spring. It's going to sort of be another month or two before it's out. So, it'll be, I guess, early summer. We're a little bit behind. The next release after that is going to be Mimic, which will be a non-LTS in the fall or maybe winter if we continue our sort of laggard pace and then the one after that will be the end release, probably Nautilus, but the name isn't really finalized yet. What's the deal with the names? They're all names of cephalopods and they go by letters increasing. So, Luminous is Luminous Squid, which is beautiful. If you go, look at Google images and wonderful, so on. So, lots of good stuff coming in Luminous. It's going to be a really good release. I'm very excited about it. The biggest one, the biggest piece that I'm most excited about because I worked on it primarily was Bluestore. Bluestore is going to be stable in Luminous and it's going to be the default back end for the OSDs. So, that's a big milestone. Bluestore consumes a raw block device in contrast to our sort of legacy file store, which consumes XFS. We use ROX to be internally for metadata, but it's all sort of packaged up one big thing that we control. It's very fast on hard disks, roughly twice as fast, both for large IOs and small IOs. For regular SSDs, it's also faster than file store, maybe more like one and a half times, but it varies on your workload. For NVMe, it isn't that different than file store because the NVMe isn't actually the slow part. We have other issues to deal with the optimizing stuff itself, so it uses less CPU. But Bluestore is the future and that's sort of where we're trying to get to. More importantly, it sort of gets rid of all this legacy stuff that we had with file store. So, all these weird performance anomalies that you wouldn't notice until you had strange workloads sort of go away because we are less stupid mostly. So, Bluestore has full data checksums on everything, so every time you read any data from the disk, it gets checksum verified, so you won't get sort of bad errors. It also does inline compression with Zlub or Snappy, which is nice and it's going to be a stable thing. Lots of people contributed to it as a group effort and we're very excited to finally have it done and out there. I'm going to show a few quick performance plots. These are showing large and small write IOs, random writes, on trying throughput and latency. The whole bunch of development branches, sort of the top one is the one that all got merged. These are a little bit old, but it's roughly twice as fast for large and small IOs, sort of the takeaway. Same thing, similar picture when you mix reads and writes. The reads aren't necessarily twice as fast, so it's a bit of a blend there. But more importantly, if you look at sort of the aggregate workload, not just sort of a micro benchmark, things like Rados Gateway that are doing index updates on the bucket indices that are stored in Rados. There were all these weird annoying things that Filestore had to do in order to make that work properly and be consistent and safe. BlueStore does it much better and so the performance improvement is more like 3x or 4x, depending on what your workload is because it just isn't as done as it used to be. So we're very excited about that. As a consequence of all that, this also enabled us to do another big feature which is erasure code support for Rados block device, finally. So the key missing piece before was erasure coded pools didn't support overwrites of existing data, now they do, so you can put a block device on top of those objects. It requires BlueStore in order to perform because we have to do a two-phase commit to be able to roll back and that it's implemented on Filestore but it's horrifically slow. And the other thing is that we rely on the check sums in BlueStore in order to do the deep scrubbing and so if you're using Filestore with the easy overwrites you can't deep scrub, it doesn't actually verify anything, let's just go read the data. So BlueStore and easy overwrites sort of go together in luminous so it's there, it's good. It's a significant improvement in efficiency over 3x replication. Even if you use a very narrow code like a 2 plus 2 or a 4 plus 2, it's like a factor of 2x or 50 cost improvement, much less storage that you have to buy. It's not perfect though, within erasure code small writes are slower because you have to update more devices when you update a full stripe with the erasure code. Hopefully we hope to mitigate that with BlueStore although we haven't done sort of the final testing that pits Filestore 3x against BlueStore erasure code so that's sort of in progress so we'll sort of know with the final how we how we net out. But on the flip side large writes are actually faster than replication because you're actually doing less IO to your devices which is also good. The implementation is still doing sort of the simple thing. When you do a small write it's updating a full stripe so we want to do things that are more clever but it's going to take a little bit more time before we're able to make those optimizations. But it works in Luminus. It's there, it's ready to go. The other big piece of work that went into Luminus was the new stuff manager daemon. It actually appeared in Kraken but didn't actually do anything useful yet. In Luminus it does lots of useful things. The main thing is that it offloads a whole bunch of work that the monitor used to do into a new daemon. So the old monitor previously dealt with all the PG stats. It ended up with a whole bunch of data that was just turning through Paxos and slowing down the monitor and limiting our overall scalability. That's all gone. Manager is doing it instead. It doesn't have any durable state and so it's much faster and more efficient. And monitor can focus just on the things that are important for keeping your cluster up and consistent and storing data. So it's going to make Sethmon scale again. And coincidentally this morning I just got word that CERN has like a 10,000 OSD cluster that we're going to be able to use for two weeks in a couple of weeks to do another Seth scale test. And so it's perfectly time to test all this new Luminus stuff to actually see how it does with these big because we don't usually get to buy that much hardware at Red Hat, unfortunately. So that's going to happen in the next couple of weeks. Very excited about that. Sethmanager also has a new REST API. We sort of took the Calamar API, adapted it. It uses the pecan framework now. It's written in Python. Manager has this nice Python plug-in framework that you can use. So that's going to be there. And there's also going to be a built-in dashboard. It's super simple. It's basically like a Seth-S on the web. But it works. Here's a screenshot. It's pretty simplistic right now. This is a one OSD cluster on my laptop, so it doesn't really show much industrying. But it shows all your demons, all your health warnings, whether things health is good or bad. It shows the log. Just really basic stuff like that. So not much in Luminus yet, but this is going to be a building block moving forward. Eventually we're going to add all the metrics in there that the manager already has. Actually, it just doesn't present them through the GUI. And that sort of thing. So moving forward there. There's a new network messenger implementation in Luminus. It actually was new in Kraken, but it's Luminus or jewel. It's new since jewel. New implementation, it doesn't use up lots of threads. It's much more efficient. It's event-driven. It's a fresh code base. It's great. It's so much better. It also has a pluggable back-in. So there's an RDMA back-end for the messenger that is built by default. It isn't tested very heavily, but I don't have any RDMA gear in our community lab. But it's being used in a few places in production with good results. So it seems stable, but it's not sort of officially supported and tested yet. So your mileage may vary. There's also an experimental DBTK back-end that uses Intel's user space acceleration library stuff. That also looks very promising. It's definitely in a prototype stage. It's not ready to go, but the code's there, and you can build it and play with it if you want. So very excited there. Mellanox has been an ex-guide. They've been the main people working on the RDMA stuff. And the other sort of nice thing that's coming in jewel is that we're finally going to have perfectly balanced OSTs. So everybody who operates a large cluster is dealing with the variation between the least utilized OST and the most utilized OST, and dealing with three weights and capacity planning and just a headache. We finally have a bunch of new tools to actually make that essentially a perfect balance. So the two tools are something called shoezarchs for crush, which basically is sort of a way to feed in all the specific parameters for a particular pool that tweak the weights in the ID so you can sort of get it to do exactly what you want. It's sort of a generic capability, but what it allows us to do is run a numeric optimization that just does a gradient descent and fiddles with all the weights so you actually get the actual output is exactly what the weights are that you intended when you put it in. So it installs that in balance problem, but also addresses something that we've been calling the multi-pick anomaly, although it's a very imprecise mathematical term. But it basically is an issue with the way that the underlying mathematics of probability that crush is based on where if you have a device that has a very low weight and a bunch of things that have larger weights, for example, if you have a bunch of racks and you start a new rack with one server in it, crush tends to put too much data on those devices and overflow them and this it's a it's annoying math but we didn't even notice it for a long time, but we can we can correct that as well by using adjusted probabilities for the second and third replica choices in crush. So the good news is that the imbalance portion of that optimizing with that is actually going to be backwards compatible with older clients. If you want to correct for the multi-pick part then you have to wait until all your clients are running luminous and understand the new stuff. The other tool is something called pgupmap which is just like it's just the ability to put an explicit exception mapping in the osd map that says this pg is stored on these osd's period so it just overrides whatever crush says and says put it here and so there's a really simple optimizer that looks at your distribution and says this osd has one pg too many and this one has one too few so i'm just going to move it there and does that. So both those tools are there but that pgupmap also requires luminous clients so you won't really be able to use it in a production cluster until everybody's upgraded on the client side. And a few other odds and ends for the rate of side. Crush has something new called device classes where you can just tag the osd's in your system as being a particular class or type so you can say these are ssds these are hard disks these are NVMEs and then you can write a crush rule that's just really simple that says map to osd's that are ssds and map to hard disks. Previously if you want to do this you had to create you had to manually edit your crush map and create two parallel hierarchies and fuzz with all the names and all the automatic crush manipulation stuff just kind of broke super tedious now it it works out of the box it's really simple so that's nice. There's a streamlined disk replacement process that's well documented so ssds you can replace osd's for using the same id's and it's going to be simple and it's actually going to work so that'll be nice. There's a new configure option that lets you as an administrator just declare what the oldest client version is that you want to support and then the monitor will sort of gate all of the other things that manipulate crush tendibles and so forth so you don't screw up so you can just say I want to be able to be compatible with hammer clients and you tell the cluster that and it'll just prevent you from doing anything that would break break that constraint just to make operators lives a little bit easier. We're annotating and documenting all the config options in the code so you can just do a dump and see all the config options what they mean and what whether you should touch them or not they'll be marked as like experimental developer only do not touch versus something that you should adjust expert only that sort of thing. There's a mechanism so that if a pg or object is stuck then there's a back off mechanism so the clients will stop sending requests which in certain recovery situations can bite people in the whatever so that they can't actually talk to the resty so there's some things like that that are fixed for having better eio handling peering and recovery speed ups new and cracking actually but nuisance jewel stuff is now in most cases if an OSD fails it immediately notices you don't have to wait for a heartbeat timeout so it's much faster failure detection and and cluster moves on so lots of good stuff that's there's an ongoing list of just sort of random little robustness stuff that's improving so good things there I'm sort of moving out of rados into the rados gateway we sort of have this high level view that in the future most data is going to be story and object stores so while block is obviously very important particularly for cloud workloads and hosting bms that's not actually where most of the data is going to be most of it's going to end up in objects like s3 and swift apis and so there's a whole raft of features that we're looking at here things like erasure coding tearing multi site federation and so on so new and luminous sort of the biggest new thing that's most exciting is rados gateway metadata search so we already have this mechanism I might build kind of screwed up here on this mechanism where you can take seph clusters and multiple data centers or the same data center but you can have multiple zones sort of causally independent rgw installations that are federated with each other so they share a bucket namespace and user and you can put a bucket in a particular zone you can replicate across zones you can do all the stuff with the new federation active active all kinds of stuff but the mechanism that allows that that does actually the syncing is pretty pluggable and so we have a plug-in that does syncing of just metadata and it dumps it all into elastic search and then there's a new set of apis in the rados gateway that are search apis that go query elastic search so if you set up one of these zones to index your object gateway content you can have an index either the default stuff or whatever headers you care about then you can go do search queries to find out what you're storing you know what file types what headers they're setting whatever you want to do so that's exciting and totally new a bunch of other stuff with the rados gateway there's a new nfs gateway it's actually been present in some versions of of dual it got back ported to some of the downstream stuff I can't remember if actually if it was upstream before but this is for the rados gateway there's a very simple nfs gateway that lets you mount nfs v4 v3 to copy in data or copy out data which is great for migrating existing workloads from sort of file based storage systems to object as you make that transition it's not meant to be a full POSIX file system it doesn't do small writes and renames and truncates and all that random crap but for just copying data in and out it works great so that's big for a lot of users so the biggest management operations headache that we are resolving is dynamic bucket index starting so for rgw users the bucket indexes if you put too many objects in a bucket the index would get big and there was a tool that you could do offline that would restart it or if you created a bucket you could declare decide what the sharding was at front but it was kind of a headache and you had to plan ahead and it wasn't very friendly finally a luminous that's just going to be automatic so as the bucket gets big it will reshard on its own and you won't have to do anything it'll happen online it'll just not be something you have to worry about just sort of if you might there's a bit of a theme of not having to worry about annoying things that we're trying to chip away these things so so that's good there are also a couple other sort of headline features that came into the gateway through a team of Marantis did a bunch of great work there's inline compression so RGW will compress the data as it comes into the cluster and write that compressed data to RADOS so that's good and sort of happens transparently there's also a bunch of encryption APIs that were implemented these are following the S3 encryption APIs I can't remember what the name of that the whole category is but you can set keys on the buckets on the users if the whole bunch of it's a big complicated API that Amazon made up I'm basically implemented it so it's there yep and then there's just a whole bunch of stuff with the S3 and Swift APIs that's been improved and added and updated and there's always a constant flow of issues there that get resolved those are sort of the big exciting things on the RADOS gateway side on the RADOS block device also lots of stuff going on the biggest thing obviously is erasure coding which I already mentioned but I'm going to mention it again because it's a big deal you can run RADOS block devices on an erasure coded pool and buy less hard disks and SSDs and everything else so it's pretty simple you just specify the data pool when you're creating the RADOS block device and it puts just the data blocks in that pool there's also a lot of work went into the RADOS mirroring mechanism so the RADOS mirroring demons are now there are multiple ones of them and they're sharing the load and they're HA and all that stuff whereas in the dual it existed but it was just one demon now it's a bunch of them and they scale out and all that stuff so lots of stuff there mostly just around robustness and not so much around new feature capabilities improved sender integration always is ongoing stuff in open stack a lot of work going into iSCSI this has sort of been a multi-year journey of various false starts and attempts to use different kernel interfaces that get tanked by upstream kernel whatever but the final the new the latest iSCSI approach is based on LAO's TCMU runner which is basically user space pass-through so the iSCSI kernel target in the kernel is going to pass through to user space two of RBD which is nice because you get the full LRBD feature set the latest and the greatest there and the performance penalty of doing that pass-through is very modest actually so that's good it's going to be a full HA solution that does fail over and SCSI reservations and all that stuff so that's coming and on the kernel side there have been lots of RBD improvements keeping up with the crush and OST and cluster protocol changes that have happened that's all there RBD specifically the exclusive locking stuff is in the kernel now upstream and support for the object map stuff which are both kind of old features but they're now in the kernel so if you're using the native kernel block device you can get that stuff and finally CephFS so if you've seen in my talks last few years or last year I guess you've seen this before but we used to talk about CephFS and Ceph saying that all these parts of Ceph were awesome but CephFS was nearly awesome because it wasn't ready yet and yada yada and finally now CephFS is production ready it's stable it is now fully awesome yay so I said I said exactly the same thing at the open-sac talk six months ago except it's at 2016 instead of 2017 the new the latest fully awesome part that is now fully awesome is that Luminous will have support for multiple MDSs active-active which also has been a long time coming and just finally there so yada yada yeah I think we lost it oh hopefully not long ago all right let me see if I can catch up I blame the door yeah there we go okay there we go yes so multiple active MDSs finally and there's a bunch of stuff to go along with that so the multiple MDSs have this load balancing framework that's all heuristic based and it's tries to understand your workload to move things around but it's hard to understand what client workloads are doing so there's also this manual mechanism is you can just go in and say this subtree this directory I'm just going to pin it to that MDS so if you want to you can just manually just enforce whatever the subtree partition is that you want if you don't want to rely on the automatic thing to do its thing which it might do right it might do wrong it's it's sort of ongoing work so there's that directory fragmentation is finally on by default this is stuff of us dealing with very large directories it'll break them up into little pieces and put them in separate objects and multiple MDSs and all that stuff it was off by default for a long time just because we didn't have the test coverage it's finally on so that's good there's been so many tests written and so many bugs fixed it's just that the stuff of us team has just been kicking their butts really getting confidence in the stability and so forth so that's all there and a lot of work also going in on the on the kernel client side also keeping the kernel client up to date with all the changes in user space and improve fixing bugs and so on so group effort here mostly from Red Hat and Souser developers but it's been it's been good so we're really excited about CFFS and if you saw the user survey also you might have seen that the Manila CFFS driver is like hugely popular number one yeah which that's pretty cool so yay yay open source so that's that's mostly it for all the stuff that's coming in luminous we're sort of wrapping up the development cycle as we sort of there are a couple of last features that were we're finalizing and getting merged into the tree and in the meantime we're also focusing on what sort of the low hanging fruit for just usability and making things easier to manage and deal with and less confusing that we can squeeze into the release because that's sort of you might have noticed there's a bit of a theme of trying to make CF less hard and so we're trying to get as much of that into luminous as we can but then after luminous there's more stuff so what's what's coming next the next release is going to be Mimic it's named after the Mimic Octopus I strongly encourage all of you to Google Mimic Octopus on YouTube but look at them they're like super super super amazing it's definitely the coolest cephalopod I've googled they're awesome anyway so Mimic lots of stuff again the sort of the sort of the highest the main motivation I say the main priorities are like you know makes us faster performance but mostly as far as features feel like we're in pretty good position the main challenge I think that a lot of OpenStack users are facing or not OpenStack CF users are facing is around usability it's just hard to manage it's complicated it's hard to set up and so we're just trying to make it easier unless confusing and a lot of it yeah there's a lot of stuff that we can do that just is sort of needlessly obtuse and opaque so we're trying to trying to improve that but there's also performance is a big thing too so on the rate of side the biggest thing that's going to happen post luminous is we have sort of a big refactor cleanup optimization exercise plan so there's some peering stuff to want to fix up but the main thing is that the main IO path where messages come in and they get processed with the thread pool and then they get handed off to the objects whatever it needs to be refactored cleaned up restructured to use more asynchronous state driven model for programming for probably use futures and all these fancy language futures and whatever but it's going to be painful but it's really important because the current structure of the code is hard to maintain because it's gotten so complicated and it doesn't perform as well as it needs to so as our storage devices get faster and faster we really need to address this sort of elephant in the room in order to to make progress so that's going to happen ongoing work in blue fs and rocks v2 there are some sort of tactical items that we're dealing with there but really we're limited by that OST piece we've done a lot of optimization on the messenger side you saw the talks earlier with rdma and so on that's getting much much faster getting stuff into and out of stuff and blue store is much much better and getting stuff on disk and off disk we sort of eliminated the main issues there but it's really everything in between that needs to be fixed up so so that's the big thing that's going to keep at least some piece of our team pretty busy but there's some other exciting rate of stuff that's coming too so one of the efforts that's been going on for quite a while now but sort of has been sort of a secondary priority has been working on quality of service this ongoing background development around the dmclock algorithm that was published several years ago in an academic conference is distributed quality of service that gives you two things you can give minimum reservations to clients or classes of clients and you can do priority based waiting for everything above that so you can guarantee so many IOs to certain clients and then whatever's leftover will be proportionally shared with waits among the other clients and the idea is to have a range of policy so we can use this just to prioritize types of traffic like client IO versus recovery IO we can prioritize pools so that certain pools will get more be faster than other pools on the same OST or just do it based on client so that this client gets a minimum reservation and this one gets whatever's left over but the problem is it's just a complicated problem especially when you talk about distributed systems and things that replicate so you're actually signing up to do IO and other people's nodes it's complicated but despite that our initial testing is actually shown pretty good results so we're encouraged that despite not having sort of a complete solution necessarily it's actually actually seems to be working pretty well the main thing missing right now is really having any kind of management framework so we have a lot of the underlying queuing stuff done but we don't know how it's going to be configured and how what the user experience is going to look at that's sort of all for TBD but the initial results are promising so this is this is an example of a test run a month or two got back where you have a couple clients that have minimum IOPS reservations of 100 or no 50 and 100 IOPS and then the third one had a very high priority so everything that was left over beyond that ended up given to the third client and not the first two and you can see that it actually you know does what it's supposed to do so so that's exciting we're sort of getting pieces that merged and it's sort of coming together but it'll probably be a couple releases before it's actually a complete usable thing the other thing that's going on is there's more work in the tiering department so once upon a time we did this thing called cash tiering and it worked okay but not great and so we sort of stopped talking about it and doing much with it this the new tiering stuff is coming and it's it's based on some pretty simple primitives so the basic idea is the a concept of a redirect so you can have a Rados object that's basically a simling to another lob object but from the client's perspective you don't know you just talk to the OSD and it proxies it through to wherever it is so instead of you'd be able to move from a cash tiering type model where you have sort of a sparse set of objects that may or may not be there in the cash tier and if you miss you go through the base tier to the new model where you go straight to the base tier which is essentially an index and it knows what all the objects are and either they're either there or there's a pointer to where they are and then they can be either in you know one slow pool or different slow pool it enables us to do other things so deduplication is a project that the the folks at SK have been working on for a while and we've been helping out a little bit with and it's sort of built on this basic concept of a redirect so the idea is to generalize a pointer to somewhere else with a manifest that says you know this part of the object is over in that piece and this part of the object is over in that piece so you can have fragments that are stored in other pools so we break objects into chunks we can store those chunks in content addressable pools where you you hash the content and so you're dedubing based on the content and you reference count those chunks and then you can have these manifest that point to a bunch of different chunks so it's how the basic idea how all the you know dedubing storage systems work with you know chunking things and so on but we're it's scale out in the sense that that's sort of that base rate us to here is acting as the index that says you know if this is the name of the object you look up the name in that pool and it tells you what the chunks are and where they're stored so that's that's that's the basic idea and that's the direction we're going in and there's a lot that's sort of still to be determined is this going to be in line chunking and storing or is it going to be post processing is that going to happen inside the osd or by an external agent is it going to be what are the policies that are going to control when you chunk things when you don't these are all sort of tbd and we're just sort of getting the core underlying functionality in place lots more work going into the stuff manager it's sort of been built as a new place to do new stuff in stuff and so there's a lot of things that we want to make it do one of the first things is metrics aggregation it's already slurping up all the performance counters and all the demons into the manager but they're not really going anywhere yet so in the short term we want to have sort of the manager provide time series data just in memory out of the box the short history the last few minutes or whatever so you can get a little Iops graph without any additional work but eventually we'll want to have that stream off to external platforms so if you have like Prometheus or Zavix or whatever your big thing is you can also just turn on the fire hose and send it all there so that's that's coming it might even it's even possible maybe that we'll get the Prometheus stuff in for the witness but maybe I don't know we'll see depends on how fast they are but there's also the intent of manager is that it's a good host for other management functions so it has this whole Python runtime environment so you can just add all these Python plugins so you can add new stuff to it pretty easily so it's going to be where we do the automatic crush optimization where we're automatically balancing your crush your crush weights and stuff it'll happen out of the box but we can also do other things like automatically identify which devices in your cluster are slow and migrate workload away from them steer IO away from those devices or even do device failure prediction so that if we're pulling smart data off the disks we can run a prediction model that decides which ones are about to fail and preemptively copy data off of them so lots to do there and it's kind of exciting because it's that there's because you can write things in Python that are more policy-based than it brings in a whole new pool of contributors that can that can write that kind of code so that's good there's also stuff on the architecture front so there are arm 64 builds I've mentioned this before but we're still trying to get enough hardware in the lab where we can actually do these on a regular basis and get them into the CICD pipeline so we have some of the hardware we're still waiting on a few more boxes but the intent is that going forward all of the new releases are going to have arm 64 packages for both CentOS and for Ubuntu there are also a bunch of patches coming in recently on PowerPC adding support for that as well and we're talking about getting PowerPC hardware in the community lab to do those builds so yeah power a while back we did some work with arm 32 builds because we built this 500 node cluster for petabytes out of these little micro servers it's essentially a hard disk with an arm server on the hard disk speaking ethernet so you're running those these on the hard disk literally no boxes hosting them that was pretty fun that was with WD labs and they're actually doing an update to their platform they're doing a that was a gen 2 drive they're doing a gen 3 drive that has a 64 bit arm yeah and more RAM and better networking and all kinds of stuff so we're working with them it's exciting if you're interested in that project or these things seem interesting to you you should contact Jim Walsher at WDC they're looking for POC people to work with so exciting good stuff and then finally client caching sort of across the board so there's on the Rados Gateway side there's a project with Boston University and Intel I think worked on it where they added a persistent cash for Rados Gateway to support their big data workloads over RGW and they it worked great they're putting stuff on NVMe they're like saturating the NVMe they're getting really good performance I mean it didn't sacrifice consistency because of the way that it was architected with they're doing immutable optics only so that was great and the students couple of the students who worked on that are now interns for the summer at Red Hat so we're planning to get all that code cleaned up and merged into the tree so that's exciting and on the RBD front we're also very interested in doing client side caching if you saw the talk earlier with Jason and Tishar we're looking both at immutable caching so that you have if you have snapshots that are the basis for clones then that that sort of immutable parent can be cached and then also write back cash so you get sort of a little latency rights that then get streamed back to the cluster CFFS actually already has a persistent client side cache if you're using the kernel client it's been there for a while there's a generic kernel infrastructure called FSCash that plugs into CFFS so if you're on the CFFS side you already have client caches at least to read all my cache it's not doesn't do written data but yes so client caches are good and that's that's sort of it for my whirlwind tour of all the new development stuff I'm going to talk a little bit about all the people who are helping us do it so these graphs are a little bit old I didn't get updated once unfortunately but lots of people are contributing to CFFS the number of contributors is increasing or it's great we love it it's it's a challenge for us to keep up with all the pull requests and reviews so apologies if you've submitted a pull request and you feel ignored just keep pinging us and we're busy but we we want you to keep doing it and the community is broad and expanding so these are the most these are the top contributors from it's updated a bit since my last talk but you'll see that there's sort of a broad set of people here so you can see all of the open stack vendors on this list with you know the Linux people and the easy stack United Stack you see a whole bunch of cloud operators a bunch of them and an APAC and also in Europe public cloud private clouds all across the board not all of them are using open stack although I think most of them are you also see hardware and solution vendors that are selling software products based on self sef or in some cases hardware products based on sef which is very exciting people like quantum twice in fact that you don't usually see on these lists and a couple of the people that actually I don't really know what they do I guess I could have googled it but but it's exciting it's exciting to see the the breadth I guess of a contribution so there are lots of ways to get involved there's mailing list we do a set developer monthly so every month we have a developer video call we alternate APAC friendly and EMEA friendly times and we just talk about whatever development issues are pending it's all virtual IRC whatever so you can join and then if you want more events like this of course there are sef days this one's awfully convenient because it's at open sac summit but we do them all across the world about once a month and you can go see the schedule there as the next ones are all in in Asia the next few months there are meetups that you can search for in various locales we also do sef tech talks I think this is like the last Thursday of every month or something there's always an on it's like a YouTube blue jeans thing where somebody just does some technical presentation of some subject on some subject related to sef those tend to be to pretty interesting all of this stuff is recorded and ends up in a sef YouTube channel so if you go there there are like hundreds of talks recorded talks over the past many years that you can go watch and learn all kinds of good stuff and of course you can follow us on Twitter yay and that's it and don't forget to google the mimic octopus because it is awesome and I'm happy to take any questions on on luminous what's coming after on old stuff cephalopods yeah to use the microphones though because they're trying to record hi yes I have one quick question about you know we're really happy to see the work on QOS coming does that also cover like per client metrics in the manager yes that is the intention so the way that DM clock works the M clock piece is the the actual Q prioritize Q that does waiting that does both minimum reservations and waiting and the D part is the distributed part so there's metadata that's shared between IOs by the clients across those Ds so you get a a global reservation and not just a local one so the intention there is that you can tag this client is going to get this reservation and even though it's talking to lots of devices it reaches that that goal but what's again what's missing is how you're going to configure it and how you store what the policies are and how that that's we have to figure that out we're interested in both sides of both being able to limit people and also being able to find the people who are abusing yes so just measuring usage is one of the things I should have mentioned that that we want to do in manager so have the have all the OSDs in the system sample the request streams and send that information back to the manager so they can build like a top top view basically of who's doing all the IO in the system okay thanks yes do you have any plan on CPU optimization because now self itself consumes occasionally more CPU than disks yes yes we have many plans to optimize the CPU utilization so that's largely what the the OSD refactor one of the main things that's looking to address but also just across the board we're doing a lot of profiling we're trying to figure out where we're wasting CPU and data structures and whatever else that just shouldn't be done the way it does so if you're a if you like optimizing and profiling whatever then like we'd love to have you be involved but yes there's lots of work going on there that's that's a known issue especially as the devices get faster and faster we have to reduce the amount of CPU that we're using yeah is there any improved plan about the cross region RGW improvements for the RGW cross region across region so in dual the the multi-site federation for RGW was almost completely rewritten and so there's a whole new way to configure zones and zone groups and replication bidirectional stuff and there are ongoing improvements to that bug fixes but there's no new feature per se in Luminus except for the the metadata indexing so but it's it's changed since hammer and firefly so if that was the last time you looked at it it's it's newer and better and more robust and all that yeah two brief questions I was wondering if there's a way that we're going to be able to detect the list or the version of all the clients that are connected to a cluster because it's a bit hard when you're doing upgrades and if you forget one of your clients yep I mentioned usability a couple times we're building a trouble board with all the annoying things and that was one of the cars I just added the other day I think it's a simple enough thing that's going to make it into into Luminus so that we can also gate changing that minimum required client on whether those clients are connected so it'll prevent you from saying require Luminus if you have hammer clients that are talking to the cluster or something yeah and then the the other thing is is there kind of work around figuring out if you have some OSDs that are example have high latency because usually those will cause a lot of issues in a cluster and are hard to find yes you're not going to have hard to appreciate so the the OSDs already report just their sort of average latency metric to the monitor and there's already a command called seph OSD perf that you can pipe to sort dash K2 and whatever and you can you can see it but it's it's annoying you have to go do it so one of the ideas is that the manager will be able to now that it it has all those metrics it can do it's do that automatically and we can you can write you know easier to understand code in python that just looks for the slowest OSDs and then writes some policy around that like set the primary affinity on those devices so that they're not primaries and so the reads go to other devices and you mitigate that or preemptively fail them or whatever it is you want to do yes thank you yeah any other questions all right thank you very much