 Okay, it's 2.20, I'll get started. So my name's Dan. Just before I start my actual talk, I wanted to promote this, that at the end of the day today we're going to have a BOF, SefOperators BOF. So please go to that pad.sef.com link. I don't know, it's faster to just memorize it or scan the QR code, but just go there and if you have some questions or topics you'd like to discuss, we can all hang out at the end of the day and compare notes. But yeah, this is my talk. So I'm going to talk about some lessons learned operating Sef for science. My name's Dan, like I said. I work for a company called Kleiso, previously I worked at CERN, and I'm on the Sef upstream in the Sef Foundation and the Sef Executive Council. Just a little bit more about me. So OpenStack summits and OpenInfo summits are like my favorite conference because this is my hometown. It was a chance for me to come back and visit, but this is also where I think it might have been the first place in 2015 that we got to present the cool work that we were doing at CERN with Sef. So we presented like a petabyte scale Sef back in 2015, and then we presented some work with HPC that we were doing at the OpenStack summit in 2018, or maybe it was already OpenInfo, no OpenStack summit, there we go. And this talk that I'm doing today is actually part two of a talk series. So I'm kind of talking about what happened. We had a decade of Sef at CERN, and I'm kind of going through the different things that we learned. The first one was actually at Sef Day in New York in February, and then I repeated that at Sef Lacan. But this was like at a time of sort of transition, I was really busy, so I couldn't even actually write the talk. So I had chat GPT write the talk for me, and it did a really great job, actually. Yeah, so that's my new best friend. At the end of the talk, it gave me like, I asked chat GPT what were my key takeaways operating Sef at CERN for 10 years, and it gave me a very good overview of everything that I learned. It's very accurate, actually, you know, planning, monitoring, understanding resilience, understanding how to test Sef. This is all very good stuff. I started to wonder that maybe I'm out of a job, actually, now that I'm trying to enter the business world. But then I remembered something, maybe this is a quick side quest. Maybe you've heard about Noel's Law of Media Accuracy. Have people heard about this before, like everything you read in the newspaper, okay, maybe you haven't heard of the newspapers, but that's like the web before. Everything you read in the news is absolutely true, except for that rare story for which you have first-hand knowledge, and then you know it's just total bullshit, right. But I think this is the same for chat GPT, like everything that is generated by chat GPT is perfect until it's something that you know very deeply, right. So I wanted to actually test it on Sef. So I asked Sef, I asked chat GPT to generate me a sef.conf optimized for performance, and this is GPT-4, right, you have to pay for this. So here's what it looks like. It's a properly formatted Sef.conf file, that's already impressive I guess. But then it gets a bit nonsense, I guess this is not a laser pointer, is it? The first and foremost we want a fast messenger, it tells you, okay, please don't use anything but this. It's the default and like, oops, I went too quick, I mean this is not, this is already getting to be nonsense, right, so this is the default. OST memory target auto-tune equals true, okay, this is also the default where it matters. Then it starts getting to OST, OST optimizations. And it tells you some things that you should not be manipulating. It tells you to manipulate some file store settings which are obsolete. It gets into some OST settings which are definitely not interesting. It tells you to complain about ops after 10 milliseconds, which is going to cause a disaster in your cluster. Some other stuff. Actually OST max back fills too, that's not a bad one. Then it gives you, ooh, you better do some journal optimizations which is totally obsolete, more file store nonsense. Some let's make sure we're using the best possible allocator, I mean this is all nonsense, you should not be changing this, then you auto-tune it. Okay, then my favorite thing at the end is to optimize the mon. You should, mon allow pool delete equals true, okay. So I put some, there's some risk that if you put some config on a presentation somewhere someone's going to use it, so I'm trying to put some warnings. This is very dangerous, like don't use this, so whatever. We still all have a job, I'm very happy, okay. So but the recap from part one of that whole talk was that the reasons why we all choose CEPF have remained the same, it's all of these adjectives, reliable, modern, integrated, flexible, open, enterprise quality storage for a good price. I think that's what we can agree. And over the decade in the first part of the talk, which you can watch on YouTube, I guess if you're interested, I talked about how we improved resiliency and performance, all of the scalability inside the OSDs, but also cluster-wide improved, and then we also improved all the daily operations stuff. So now operating CEPF gets more boring, thanks to all the contributions from different authors and organizations. So then I get into this talk, which I want to talk about some very specific, best practices for the different phases of a CEPF cluster. So planning, implementing, operating, and debugging CEPF. So this is just my ideas about how you plan for a CEPF cluster. This is very high level stuff, okay? But maybe some people are not familiar with the different orientations of CEPF. So this is just my very high level summary. Okay, so the first question, and I think Steve Oshel pointed this out at the beginning. You need to understand what's your main use case? Are you performance or are you capacity oriented? And that's going to push you towards different flavors of hardware. Performance orientation, okay? Always use all flash and can be these days, and six to eight gigabytes of RAM per OSD is the target that you need to aim for. If you're capacity oriented, use large HDDs and aim for, okay, there's different recommendations about percentages of metadata, but generally around 150, let's say, 150 to 200 gigs of NVMe flash per HDD for the Rocks TV is good and aim for four gigabytes. And you do, I'm glad to hear that this is now generally recommended, but you do want to for CEPFS or S3 use cases use all flash for the bucket indices or the metadata pools. Now, buying hardware for CEPFS MDSs, this is where it's hard to give general recommendations. The rules of thumb in CEPFS are to start with a single MDS. Don't have the idea that you should start with multiple MDSs, because it's actually very complex to run multi-MDSs. Scale out metadata requires lots of experience, I would say, with your workload. You should also target something like up to 128 gigabytes as a memory target for an MDS. Beyond that, I think that's when you should start looking at multiple MDSs. But also when you're buying your machines, 128 gigs would be the memory target, but plan for at least 50% or even double the amount of round, just in case the MDS overshoots what's happened, which happens quite often. The other demons are less demanding, so I think people understand what to do there. Architecturally, you should consider early on, if you're just starting with CEPFS, you should consider early on if you're going to use, you probably have the idea to have one CEPFS cluster, but maybe you should already think about multiple CEPFS clusters from the beginning. Because a single cluster can be a single point of failure. My friend and colleague, Enrico, described, he hinted at a little failure scenario that I'll actually elaborate on in later in my talk. It could also be a bottleneck, right? So why a single point of failure? There's risks during upgrades. There's possibility of bugs, there's possibility of infrastructure failures taking down the whole cluster, and also operator error happens quite often. Well, not that often, but it happens. And it can also be a bottleneck. It can be a bottleneck because if you have multi, I say here like, let's say different users, suppose you have different use cases of S3 and CEPFS or block storage and FS and S3 and all of these different applications in the same cluster. There's no real pool to pool fairness in CEPFS. So you could have one of those applications overloading and impacting the other users. So it's better to separate these things in different clusters. Also, if you start with one cluster and think I'll solve it in like three years when I grow, like migrating data, that can be very painful. This can take, it's definitely not like live transparent to the end users. And it can be very labor intensive and take very time consuming. Also, the other thing which is like these type one decision you have to decide early on is you should decide if you're doing replication or erasure coding because you can't change this on the fly. You can resize replication pools, although normally we just do three replicas. You just go with that. But you can't convert replicated to erasure coded pools. And also you can't reshape erasure coded pools. Some people think you can go from like four plus two to eight plus three or something, you can't do this on the fly. This would be nice if CEPFS could do this transparently. There are tricks to do it, but not generally an easy thing to do. And the user recommendations are like three extra application or four plus two and then two plus two, maybe eight plus three. Wider ECs, okay, they can be interesting for some use cases, but performance can really start to suffer. So now you're implementing, you bought your hardware, you want to install CEPFS. I'm not going to go into detail into installing CEPFS, but basically everyone here that's maybe using the community release of CEPFS, you should just use CEPFSADM or ROOK, full stop. Anything else is not supported. I would love to support other ways to install CEPFS, but these days it's, these are the two supported options. Once installed, okay, make sure you do some burn in validation and get to know your CEPFS cluster. Start to start with a single rados bench on a test pool, have plenty of PG's, make everything set up so that you're using your disks. And you should definitely be able to saturate the client network wherever you're running that test from. Then coordinate some, like several rados benches in parallel and make sure that you can saturate the OSD hosts. I don't say like what I mean by OSD hosts because maybe you're going to saturate some device limits first or network limits on the OSD host, but you should be able to saturate something other than CEPFS. If you can't, it means I think something is broken in your cluster. This happens often. Also, don't just run like two minute tests and then go home and say job done, like run something over the weekend, run something one week long so you can really see that your components are stable in your cluster. And then at the end of this, make sure you just delete everything and start again and then go into production. Also, don't forget to do some kind of validation of availability, practice rebooting the hosts, practice understanding what happens when you're recovering hosts or you have multiple hosts down and PDs go unavailable. Just generally like don't be shy and try to break your cluster. This is a very good practice to understand actually how CEPF recovers from failures. Yeah, so now something about configuring CEPF, you have it installed. I think maybe you want to start already playing with the configurations. In this year, 2023, CEPF's been around a long time. And as like AI systems have indexed the web, I mean Google indexed web too. There's thousands of recommended tunings online and they're almost all out of date, okay? Almost everything's out of date. If you're using the latest version of CEPF, the defaults are very, very optimal, they're pretty ideal. So, of course, anyway, I have some suggestions which are not the defaults. But some of those, I'll go through them kind of quickly. It's too bad I don't have a pointer. So, upmat balancer should be on by default, but it doesn't do a very tight squeezing of balancing the PG's. So you can configure that by setting this max deviation to one or two. This will squeeze the utilization to be very uniform across your cluster. The max backfields and max scrubs is not high enough. These are defaulting to one. These should be higher. Especially on Erasure-coded pools. This one's not enabled by default, and I don't really know why, but if you're getting inconsistent, if you start to get inconsistent PG's on your pools, and it's due to, and you should look into it. And if you find that it's just due to weak rights, so it's reading data out, and actually if you write it back, the disk is just fine. You should probably enable OSD scrub auto repair, save some headache. If you have large high capacity HDD nodes, you should use this, oops. You should use this option monOSD down, out, sub tree limit equal host. This will prevent if a whole host goes down at once. That scenario, a whole host go down at once. Normally you want someone to manually go in and get it back running again and not backfill drain the whole host. If that's the experience that you want from stuff, this is the option you use. After your pools are created, PG counts are correct. I mean, people in this room should still try to understand PG's. I don't like this idea that PG's an abstract concept. Try to understand the number of PG's that you should have on an OSD. And once the PG counts are good, consider disabling the PG autoscaler. It's not that pleasant to have it resplit PG's, merge PG's in production when you're running a real system. This is not recommended by me, at least. There's some nice flags, no delete, no size change that you could set on pools that you want to prevent, especially operator error from changing things by accident. And then you can also set no PG change to really make sure that you don't start any splitting or merging. Yeah, so I mentioned about PG's. You can just use SEPH OSD DF and SEPH OSD DF tree. Count the PG's per OSD, make sure it's around 100. I think in the near future, we'll probably start considering increasing that recommendation. You can get a lot more performance now, especially on NVME with more PG's. Mark Nelson, who I'm working with now at Kleiso, he's working on this as a general recommendation at the moment. Some things about the MDS. So this normally requires a lot more attention tuning. It's very workload dependent. You can start by increasing the default debug level to 2 slash 5. Then you'll get a kind of heartbeat every one second from the MDS. It prints the memory, the status of the inodes cached and the status of the memory usage, and you can then see, okay, this workload is like, you can start to get a feeling for how the MDS is working. There's a nice option. If you're having MDSs using lots of memory and then they restart and they take a long time to reload all the cached inodes. There's this nice option, MDS OFT prefetch dearfrags equals false. And this thing will let the MDS restart very quickly without having to preload all those inodes. If you're getting overloaded MDSs and lots of warnings about cache recall and stuff, there are some options with these names, cache trim threshold and recall, various recall options that you should look into. There's good documentation on those, although it takes a little bit of experimentation, generally just increase those numbers a little bit. That's what I would do. And if you have misbehaving clients and your environment allows you to evict clients, you should maybe consider even evicting clients automatically after maybe say, on some clusters, this might maybe 15 minutes is a good number for this. If a client doesn't revoke its caps for 15 minutes, then you can evict it. You also want to configure your hardware. This is my favorite thing to talk about this year, which is that modern devices have a media cache on them and that media cache is enabled by default. And this is the experience on a Cef cluster that we had at CERN. We installed some new hardware and the latency, this is a small probe running four kilobytes every minute, four kilobytes objects of the cluster and it was suddenly going up into like tens of milliseconds territory and then the users were also complaining about the performance and then we found that actually you need to, for BlueStore to be able to access modern HDDs efficiently, you need to disable this cache. So I've shown this plot before, but this is the impact of disabling that cache on their latency. So this is now a general recommendation at the Cef.com docs in the hardware recommendations section. So there's some recommendations there about how to benchmark your devices and how to know if you want to disable that cache. OK, that's configuring Cef. We have some advice, let's say, for operating Cef. So if you've planned and validated and configured stuff well, then these days operating Cef is quite boring. It should be quite uneventful. Device failure should be transparent and replacing devices should be as expected. All the tooling is quite good for that now. My preference is to, in that previous plot, I mentioned that's like a Rados bench. I just like to run just like the kind of like sysadmins thing on the side, just have one probe running independent of your Cef systems, just so you can have an external metric about the latency or performance of the cluster. Since for 10 years, I've been running this on some clusters. Just a 10 second write some small objects with one thread and store it in a database somewhere. You can do that similarly with an RBD bench or some S3 commands, if that's your application. Also, pay attention to have whatever you're using, page or duty or whatever monitoring system you're using, pay attention to health warns, don't ignore them. Like if you're getting slow requests in your clusters, this is not normal. If you're getting them all the time, it means something's wrong and something to be investigated. They can escalate in severity and then you lead to more pain, especially for CefFS. If you're having warnings from the MDS or warnings from CefFS clients about different types of issues, those should be investigated and understood. If you ignore those, those can lead to outages. That's happened to us in the past. Another kind of practice, I don't know if people do this, but if you're, I'm a CLI guy, are people CLI or dashboard people, CLI? Yeah, just, I mean, if you're making changes, don't just like type Cef OSD, pool set, whatever. Make sure you have a watch Cef status in the corner, that you can watch what happens immediately and maybe undo what you did. You don't want to like do a change and then see, oh, what happened 15 minutes later? That's kind of weird. So watch Cef status is fundamental. There's a couple of other commands like that. There's one that I think, maybe people don't know about it. I didn't put on the slide, but over lunch, someone was asking, how do you monitor CefFS? That you should, on an MDS machine, there's a command called CefDemonPerf, so D-A-E-M-O-N-P-E-R-F. If you run that like you run CefDemon, people will know how do you use CefDemon to interact with any kind of CefDemon. You can run CefDemonPerf and it gives you like a every second kind of top like output that you can get from MDS. That's useful. You should also, if you'd like to dig deep into this topic, you should probably look into CefPerfDump and CefReport and start instrumenting some of the metrics that you get from there into your custom dashboards. Here's one of my or a couple of my favorites from over the years. This plot here, these set of four plots here were related to OST maps. So from the CefReport outputs of the square on the right here, from the CefReport output, you can understand what the cluster is doing with OST maps, how many are outstanding because an OST map is something you don't want to accumulate by default Cef will keep around 750 OST maps and it's like in the Mons database and then distributed across the cluster. Whenever there's backfilling or other things going on where the cluster's degraded or doing some work, then they just accumulate. In the past, there were scenarios where you could accumulate millions of OST maps which was getting, leading toward a disaster. So it's nice to monitor that kind of information. So in this one, yeah. So I'm showing these four things because you see that there were maps accumulating up to a few thousand, but that's because there were degraded objects and placement groups recovering or backfilling. So this is all normal, but it's something fun to monitor to start to understand Cef. And here's another one. This is a CefFS set of plots. So there's in CefPerfDump of an MDS it tells you lots of latency metrics and one nice metric that you can monitor to get the, to understand the performance of an MDS is the create latency. So it tells you how long it takes to create the inodes that it's creating. On one of our systems, we had, so we have a Grafana plot that's like hard-coded not to exceed 10 milliseconds, but we noticed that it was above a second actually for a while, oops, going back. So there was a problem here where creating inodes was taking more than one second. This is a kind of a known issue related to Kubernetes clients where after recovery, if a Cef client had a file open and that file gets deleted that it will try to like reopen that inode even though it doesn't exist. And if you have a multi-MDS system it will say I don't have that inode try the other MDS and then it just like iteratedly goes through all the MDSs in a loop but none of the MDSs say you've tried them all like you can give up now. So it just keeps going in an infinite loop. Now that happens, it's fixed now in latest Quincy and probably latest Pacific, but this is what was the issue here. So you can see like the latency show that there was a big problem. There's high CPU usage on the MDSs and higher number of, this is HCR slash S that's the client requests to the MDS. So this is also showing something is wrong. So those are some of the metrics, some of my favorites. So one of the like FAQs and Cef if your community user is like, when do you upgrade? So of course, if you're supported by a vendor like follow their advice but otherwise there's no simple answer. I think by now people have understood that you should not upgrade on day zero to avoid regressions. Upstream we do all of the kind of testing that we can but it still doesn't catch everything. But you also definitely don't want to lag too far behind. I think going beyond more than a year of the major releases is you start to, you start to run into risks of like, if your community supported you need to ask for help and the majority of the community users probably forget the issues that happened with those old releases and the upgrade. There's always little nuances. So you should try not to try to stay a little bit trendy. But yeah, so if you are self-supporting definitely follow the community mailing list. There's a Slack now that's quite active with actually lots of channels, that's cool. So you can dig into different topics and look for news from early adopters. Also try to test in pre-production environments and share the results with the community. We're now, Yuri's doing a lot of work to publish release candidates. There is a release candidate out for Reef now. So if you do have the opportunity to test that kind of pre-releases and share the results that's really valuable for everyone in the community. And yeah, also understand your rollback options. It's normally not possible, but sometimes for minor releases it can be possible. Yeah, so now the last part would be about debugging stuff. So what to do when it fails. It's totally rock solid, but it's not infallible. Buggers do happen. It's weird, like now that we moved, okay, it's not even that recent, but we moved to a containerized world. I have to talk about log files. Like I guess I'm too old. I don't even know how to access log files properly in containerized setups, but it's essential to be able to access your logs and know how to manipulate the debug levels of Seth Demons if you want to debug stuff yourself. So know how to access the logs, know how to manipulate the levels. It sounds like really stupid. I'm embarrassed to mention this, but like when you're under pressure, if things are down, you don't want to be looking for where the log files are and how to access them. Where is it journal D or is it in var log Seth? Like you better know that beforehand. The default log levels are not, if there's a problem, the defaults are not verbose at all. They won't tell you what's going on. So you'll want to, unless it's like a major failure, okay, then it's gonna give some abort or some assert trace back. But if it's just some misbehavior, then these are good places to start. So debug MS equals one. What that does is it just turns on the lowest level debugging for the messenger. And with that, you can see the operations between demons. You can see from OSC to OSD or OSC to MDS or clients to the MDS. So there you can already see, for example, if you have an MDS overloaded, debug equals MS, you turn that on for a few seconds, turn it off and you can see, oh, this client is like doing get, it's it's statting a file like in a loop. And you can learn immediately that you should evict a client. With similarity OSD gets interesting around debug level 10 MDS gets interesting on debug level seven. Yeah, so as I mentioned, you want to kind of turn this on for 10 seconds. It's very verbose, turn it on for 10 seconds, switch it off and then start to have a look. But what happens when it really fails? So very rarely things can go very, very wrong. In the past, in the past decade, I'm only really aware, there might have been other cases, but I'm aware of two scenarios that were like really bad where workarounds couldn't, there were no workarounds to solve these two issues. In the self community, we started calling these the bug of the year. Okay, so the first one was just at the start of the pandemic, it was this, it was just before we all got to go work from home. We had a major outage at CERN actually, and there were like rumors about this bug. It was, there was already a bug tracker about this, but nobody understood what was going on. And then it hit us at CERN and Rico mentioned, we had a cluster wide outage, like all the OSDs all stopped saying that the OSD map was corrupted. This was a disaster, right? We had the head of the department and lots of important people in the office saying like, how's it going? We're kind of stressed out. We recovered the cluster by, you can like extract old OSD maps from the OSDs themselves, even when they're not running or from them on when they're not running, and you can inject them back into the backing store and then get the, you can get the OSDs to boot this way. And that's how we managed to recover it, but that was not, so we just saw, you know, these like naive guys, we had the OSDs not running CRC errors, so we just put them back into, and it started, but then for a week, we tried to understand what had happened and through various levels of debugging, and you can watch the YouTube video, it's kind of interesting, I think. Okay, I mean, I shouldn't say that, but I was, I gave the talk, but it was a fun experience. We found that there were four flipped bits and those were caused by a bug in LZ4 that in a very, very extremely, like one in 100 trillion scenarios, like if you have non-contiguous memory, it would corrupt the thing that it was compressing. So the solution was, so CEP is now resilient to this and also LZ4 is fixed in the operating system. So that was 2020, and then just in the last year, we've had another scenario where lots of users were reporting OSDs just bloating up in memory, using hundreds of gigs. People were throwing swap, people were compressing memory, like no one knew what was going on for several months. Even restarting the OSDs, they would like right away start up and use all the memory again. So this turned out to be a bug in PG splitting and merging. So the PG's keep a log, it's called like the duplicate operation log, PG dupes. And there was like a bug in splitting and merging that was causing these logs to get like, normally they were ordered sort of alphabetically and you could just pick the first, like you want to trim a thousand of them. That's nice. You could just trim the first thousand of them, but because it violated this ordering, you were never trimming them anymore. So they just accumulated the solution. Okay, this took some, this took a few iterations to find the stable fix. It's now fixed in all the latest versions. This shouldn't happen. There's an offline for anyone. I hope no one is still suffering from this, but there is an offline trim command that can fix your OSDs if they won't start. And then now online, this doesn't happen anymore. Okay, so that's the end of my talk, but I don't know how to end talks. So let's ask ChatJPT how to end talk. Tell me, I asked ChatJPT last night, tell me a joke about Seth and OpenSack. Okay, here's the jokes. Why don't Seth and OpenSack play hide and seek together? Because every time they do, OpenSack says, you can't hide. I always find you in my block storage. I mean, that's pretty good. It's true, right? Okay, yay.