 Hello everyone. My name is Prashan Dhanke. I'm a RADOS engineer at IBM Canada. Today I'm going to talk about safe troubleshooting in a containerized environment. So let's start. I'm going to talk about diagnosing problems in a safe environment and how to troubleshoot them. Next up is like log collection, where to look for the logs and how to increase the debug levels for the demons. And understanding those logs, how to analyze them. Next one is like troubleshooting safe OSD demons as well and MGR safe manager. And next topic is like memory issues, like how to troubleshoot them and how to kind of like what data we needed to analyze the memory issues as well as how to troubleshoot them. And the last one is like how to troubleshoot the crashing or other sec folding safe demons. So the first is like diagnosing problems. So you usually see like if you are seeing some problems in the safe cluster, you usually start with the safe status and see how the cluster is doing. So whether you are seeing some issues with the safe status like health one or health error. So that indicates like you are having some problems in the safe environment. So you start looking for the kind of like logs or some other additional information to find out where the real problem is. So logs is really helpful in that perspective. So in the safe continuous environment, all the log goes to the general D and where you can find all the logs for the demons. Like if you talk about the cluster logs, then you need to find it in the safe monitor demon general D. So you can get those from using the safe general CTL command, general CTL hyphen you and the demon name you can sign. If you want to have the dedicated or rather like the logs in the specific file just like what you used to have it in the bare metal environment, you can use a log to file config variable to set it for the specific safe demon. Then you'll get the logs in a specific log file. So timestamps. So timestamps is really important. Like when you are talking about the problems, right? So you are seeing some problems, but you don't know what really happening. So and you want to track it down to what really happened before the cluster seeing the problem. So you need to find the timestamp when the things started going left and kind of like way. So you need to go back to the previous events. If something has happened, like some OSDs are going down or some being marked down or some OSDs are crashing or some other issues. Like you have done some changes in the environment like a network changes or some other changes in the environment. So the months months are really important. Like when you're actually trying to access the cluster. So if months are not doing good or rather like they are not in a quorum, then you'll find a client hung or other like ion issues. So in that, in that case, you might need to find some, like whether the, it's really the Mon quorum issue or some other issues like networking or some other configuration changes like configuration. Like you have probably deployed the wrong safe config on the client node where you have the wrong mon IP addresses. If you really want to track it down to the safe monitor. So months are not in quorum. Then you need to enable the debugging for the, all the monitors and then track it down why they are actually not able to select the lead monitor. So slopes, like you often see the slopes in the cluster. So that's usually because of the kind of like you have rather like, what are you going to say? One of the OSTs rather like slow or if it's affecting the multiple OSTs, you see it in the step status. You'll see like, okay, this X number of OSTs are having a slow request and implicated OST numbers you can say. So if there is like just small set of OSTs, then you can track it down to that specific OSTs and then enable the debugging and find it out. Whether the disk is slow or rather like some networking changes in that OST node or some other issues like the firmware issues. And if you are affecting, if it has affecting multiple OSTs, then you need to look in the broader like if there is a network changes or something or some environmental changes. If you have done recent like upgrades and that might have some bugs in the safe core, that might be affecting this slow request. Similarly, like as I talked, like the network stability important, like if network is not stable, then you'll see a lot of log issues in the safe cluster. The firewall and IP table rules are also kind of like you have configured it, but sometimes what happened, you reboot the nodes and you find it like, okay, after reboot, then this specific set of OSTs and that OST node is not able to communicate with other OSTs. So likely on the restart, the IP table rules had changed or some settings had changed. So you need to look around that area. Resource limitation. So in the container environment, each safe demand containers doesn't have any limits as if you talk about like memory or CPU. So some customer configure the safe containers to have the limited memory or CPU. So if that is actually affecting your performance or that's affecting the overall safe working situation, then you need to look for around that area, like whether the OSTs or other safe demands are encountering any resource limitations. So you usually see the safe demand crashes as well, whether it's OST, Emon or any other demands. So you look for the safe status and look for some like pointers like whether the safe status, you are seeing any demons are recently crashed. That's likely the crash model has collected the crash information. Sometimes you don't see it just like in the root environment. It's because it's something to do with the kind of the core dump, like it's not getting generated on the environment. The memory leak issues. So it's not often like you'll have the memory leak issues. There could be reasons, different reasons for that as well. You might be like tune the cluster in a way that it's actually you are disabled the OST memory target, just like what do you call it? It's attotune, like blue star attotune flag you have set to false, then you are actually eventually disabling the OST memory target. So the OST is not rather like sticking to the memory target set. So it overshoots. Also like in case of there's a recovery or scrubbing is happening, that might also in that case also the oral OST demand memory grows. The performance issues, it can be environmental like network or the disk or any kernel box, I would say. But overall like if you say like you have benchmark cluster and it's often like giving some bad performance later on, then it could be environmental. But if you're expecting like some latencies, just like 80 cd, like less than five milliseconds, and you're all the OSTs are backed by STD. Then you can't get that much rather like latency in the, from the, from the same position. So inconsistency, you often see like some pitch starts are inconsistent, like number of bytes or even like in case of like safe raw usage. So raw usage is like, you are seeing the total usage for the cluster is the most, I assume like your five petabytes. But overall the OST used bytes is kind of like six, five point five. So yeah, there is a 500 TB of shortage. So it likely because the OST or other like blue stories not really in the space from the disk, you can say. We can discuss that one like a letter or part of the slides. So this is one of the issue you often see in the open stack environment as well. And even the star and the bare metal environment as well where you are using it for the different applications. So here if you say like, this is public network and cluster network are different for them one. But the rados command is actually not or other like the safe clients are not able to communicate with the cluster. So safe cluster health is okay. Everything is looking good. And you are not seeing any problem in the network as well. But here the problem is like some of the OSTs are actually having their front addresses on the client network where they are supposed to have it on the public network. It's because of the race condition with the container and the networking system like system, the network manager. So it's sometimes it usually takes like a couple of hours to find it. Okay, what's really going wrong. So here you can easily use some like some commands, safe commands like rados or RBD commands and just run it in a debug MS mode like enable the debug MS logs and see which kind of like whether all the OSTs are responding. Or other ways to just take the OST dump and see whether the all the OSTs are having the front addresses on the public network. So in this particular case, if you see like the OST two, it's actually was not responding to the this particular like OST off. So the OST two is actually listening on the on the cluster network IP address instead of the public. So this is kind of like generic. So this is no problem at the moment. So to kind of like to avoid this issue always define the public nature for the OSTs. So that way, actually, if the OST when it starts, it makes sure the OST is binding on the IP address from the public network only. So the, the puff data. So this one is useful when you're actually trying to investigate the high CPU consumption issues for those safety months. So you collect the puff data, puff record, and then using the puff report, you can see this kind of like call stacks. So which usually tells you like where the most of the time is being spent in which call. So this one is kind of like it's spending most of time in the while adding the key value pair to the ROXDB. So this is a kind of like a normal like a system to service file and the unit dot run file in the constraints environment. So the first one is like, so earlier you used to have in the, in the before Nautilus or other like in the Nautilus, you have the pod run command in the service file only. But now we moved everything to the unit dot file for the individual OST. So the first command in the unit dot run, not sure whether you are able to view it properly. So this one, this one is just to activate the OST and the below one is actually running the actual safe demon or safe demons, you can say, with the safe monor MGR. This is a PSD you can set the possessory for the safe demons in the content environment where you have this con man, the content manager monitor and management tool, which actually creates this, the container process and pod man or use by the like it uses the run C kind of to create the container process. And then the pod man in it or other like pod man, then starts this OST demon under it. So, sometimes you often need to change the some add some like assume like your case like the OST is not able to kind of like communicate with the monitor and it's not you want to debug it and see what's really happening. You might need to use a save dot com or unit dot run like add the additional debug variables in the safe command in the unit dot run and you can start rather like the safe demon in a more debug level like higher level level level. So safe idiom. So how many of you using safe idiom or other like safe and civil to manage the cluster. And how many like using the Rooks of the open shift. Okay. Yeah. Thanks. So earlier, if you remember, like if before the safe idiom, so we usually have to follow these steps to mount the blue store to have the access to the actual disk. You can say to get the blue FS or other like investigate the blue store. You can say if you want to get something from the blue store. So these steps are being awarded now with the safe idiom. You just need to run the safe idiom shell and then name of the demon to get access to the blue store divorce or disk. You can say we can usually like use it for the exporting the blue FS or accessing like if you want to just use the blue store content. So this is safe like for the exporting the blue FS in case if you want to investigate spill or blue FS, spill or bug or some other like corruptions rocks to be corruptions as well. Similarly, if often like in the before blue store, like if you are used a file store as well, right? You often see the PG directories on the on the file store, but you can also use this fuse month for the blue store to have the access to the PG directories and see where the actual objects are getting stored and how is kind of like the structure inside the blue store as well. So this is just for the information perspective. We never use to mount this fuse month. You would never rather like you don't need to do that. Actually just kind of like if you want to see like how the things inside the blue store. Similarly, like like exporting PG like you often see like the some situation where the OST is rather like is is crashing after connecting to the month. But the blue store itself is kind of like consistent. It's not showing any corruptions and you need to export the PG from the that particular OST because some pieces are down or incomplete. And you want to repair kind of like repair that PG or other like recreate that PG. You can export that PG and imported to the acting set of that PG that to the other OSTs. So as I discussed earlier as well, like the cold dump is really crucial. Like if you want to investigate some memory leak issues or some safe code bugs, you can say or even case of like if something is what you can say. Due to some corruption, blue store corruption as well. And in some cases as well, like if you want to dump the in memory logs as well. So cold dump is really important here to investigate or see the problem. You sometimes see that the cold dump is not getting generated. So that's because of like you have the default core file limit is is not set properly or rather like it's default to the soft limit is zero. So that's meant like it's the system doesn't like system decode on service is not able to dump the cold dump file or other creative code on file. In that case, you need to use this, like define this default limit core to infinity, like setting the soft and hard limit for the hard limit for this limit core file in the system. And just restart the system D or restart the node. So as I said earlier, like all the demon logs goes to the general D. And if you want to get hold of the specific demon logs, then you need to do the general CDL command. And you often get the source report sometimes. And all the logs are inside the same journal, journal D archive file. So you have to use the general CDL hyphen of one file and archive file path and use the service name in the hyphen option to get the specific demon logs. If you are uncomfortable with the journal dance, just in kind of like you're not able to trace out the logs through the journal D, you can often go back to the old way of like dumping the demon logs in the specific file. So you can just set the global log to file to true. And if you want to have the cluster logs as well in a specific file as well, then just set the mon cluster log to, mon cluster log to file to true. And as well as you can disable the logging for journal D using these four config parameters. Yeah, if you want to set it to the, so the monitor config database has really made it easy to easily apply the configuration throughout the cluster. So with a just single command. Earlier you have to do it. Safe tail command to do it like for the specific demons, like monitor or the OSTs. But with the monitor config database, you can easily apply to the all the set of the OST types you can say. Sorry, the safe type, safe demon type. So, but sometimes this config rather like if the demons are not able to communicate or rather like having problem with the, communicate with the months, then this like safe config command will not be helpful. So you have to do it through the safe.conf or through the safe admin socket commands as well, not admin socket, yeah. So you can rather like debug it through the, the boot up process or the OST or the safe month boot up process as well, like see where the things are going wrong with that one. So, as the safe logs are also important equally the system D or sorry, the message of the kernel logs are also important. So sometimes like everything is looking good from the safe, safe logs perspective, but you know, never know like what really happened on the system. When the things are not going in the right direction for the safe node, you can say. So that particular nodes, the message of the kernel logs can give you more details like was there a kind of like the task hung issue or some other issues you can say. So this is different issues kind of like for the safe object storage demon. You often see like the flapping worsties, which could be because of someone who was like the blue store corruption or because of like, OEM killer event like some specific like when the cluster is under recovery or scrubbing activities going on or some other issues as well. So in that case, you need to track it down through the cluster logs, what really happening and see the events like whether it's because of like the OSTC being marked on wrongly, whether it's because of like the multiple reporters from the peer OSTs. And if it's just specific to the OST nodes, then you can find it. Okay, this specific OST node is having problem, but it is throughout the cluster. Then it might be related to some environmental issues. So in the flapping OST case as well, if it's a crash, then you'll have the core dump and even the crash module captures the crash info as well. So what really happened when the demon crashed, you can say. So this blue store, the next is a blue store not freeing up the display. So this is sometimes it's because of the really the blue store bug or something to do with the discard, like the discard functionality. So the STDs don't support the discard functionality. So and there is a case of like you have said a blue store BTO discard flag to falls, you can say. So in that scenario as well, like you will see the case like you have the in the Rookscape environment. If the disk is backed by the VMware bigger file, and even after deleting the data from pools, you'll still see some data is kind of like the blue store is not really in the space. So it could be related to the blue, the discard functionality. We sometimes see the discorruptions as well, the blue store corruptions or rocks to be or the even the superblock or the label block as well on the blue store. So in that case, like it's if you talk about the rocks to be the corruption happens might happens during the compaction rocks to become compaction. And if you talk about the blue store, it could be it's it's during the kind of like it's if you talk about like because of the exponential growth of the blue FS lock lock file, you can log. And if you talk about the blue superblock corruption, it could be the multiple safe containers are running against a single OSD that could corrupt the label block as well as the superblock as well. So but recently this has been fixed. We, you know, there is there was one bug with the discorruption with the superblock. The blue FS, you know, space. So if the blue store is highly fragmented and there is no space left to allocate for the kind of like blue store allocations are filling. So it could be like because it's like blue store share alloc size is 64 K and the OSTs has iteratively consumed all the 64 K larger blocks and and the allocation is failing because the allocation is not aligned to the 64 K. So you have to get the freedom from the blue store and see how the allocation is and which specific kind of like the blocks are being used and where we have the any space for the like any any blocks from the specific allocation size you can say 64 K. So you can lower the blue store share alloc size to 60 16 K. But you need to analyze the freedom carefully and then apply that so that you can at least start the OST. And then once the things further looks good from the PG perspective, all the pieces are active. Then you can actually redeploy that OST. So memory leak. I'll briefly talk in the next slide. So as I said, like it could be the configuration issue or there is a real memory leak issue as well. So we have not seen that much like high CPU consumption since Nautilus, but there is a possibility like because of some tuning from the safe from the OST perspective. So for that, you need to use a perf perf tool or even you can use the wall clock profiler implemented by Mark Nelson. And even like there is a new tool like O profile if you're aware of it. So this is like in general like you see it like the unresponsive and slow safe commands. So you if you are if you are seeing this in more often in the rook safe environment, then it's might be related to the safe MDS. So the MGR volume plugins, which actually handles is safe FS commands, right? So you might have some corruption, the file system corruption, the MDS corruption, you can say, or the MDS is not rather like healthy. So you might so there could be some other issues like some other models, safe MGR models also kind of like affecting these overall like issues, you can say. So the another model I can name it like it's a progress model as well. We have seen it in the past. You might see like some like the pages are in unknown state. That's likely because the MGR is done since long time or the OSTs are failing to send the page starts to the MGR to keep it updated. You will less likely see this MGR crash as well. You mostly like you'll see the the model crashes like because of some bugs in the model MGR models. Yeah. So the memory leak issues, I would say. So apart from like monitoring tool, like some PS commands or using some dashboards, okay. So dashboard or use in the graph on as well. You can use this apartment starts commands as well to monitor the container issues. And if you really say, okay, this, this particular demon containers consume a lot of memory, then you can track it down through the kind of like core dump or some additional like. Dump mempools and see where the whether it's really the demon memory leak issues or not. If you talk about the with respect to OSTs, it's might be because of the workload. But yeah, it's it's you need more information to track it down to the what it's if it's really the memory leak issue. The heap profile also helps, which actually gives you the TC malloc profiles, like how the heap looks like when you see it like from the how much is the heap size for the specific save demons as well. As I said, like dump impulse also helpful unique. You can track it if for the OSTs, you can see it from the blue FS and blue store. But if it's MGR or more, then you need to track it down from the other areas as well. Some like anonymous buffers like that you can see. So we often use the core dump as a additional. So I'll quickly wrap it up. So so you can create a core dump when the memory consumption is at the highest level. And then you can generate a core dump. So in the container, you don't have the kind of like gcore command because it needs a PID namespace access for the containers. So you can generate a core dump directly from the node and give the like send a trap or abort signal to the save demon directly. So you can generate the core dump. And then you can analyze it using the core analyzer, which is another like very good tool to analyze the heap and even use a string command to investigate from the core dump. So this is one of the kind of like the GDP script. You can use it to get the in memory logs, safe logs. If you don't have the safe logs available, because if you talk about the root safe, right, when the crashes, the root safe creates another pod container, OST pod. So you lose the earlier logs, and you don't know what really happened. So you can still able to access it from the node, but you have to go through the, go to the ODF or other like the storage node and then find it from the slash lock safe. But if you have the core dump, that's also useful to get all the in memory logs as well. So you just need to use this script in the from the safe source code in the it's inside the script directly. So its name is like safe dump log.py. So you just need to find the particular traits which has this particular frame. It's a particular frame you can say which has the access to this safe context global safe context object. So that has a log handle log handle. So you just need to run this GDP script function to get all the in memory logs. So, so this is like the crashing or set faulting safety month. So if the safe OSD is starting, but it's not able to put it properly and it's crashing before it's actually able to talk to the safe monitor you can say. So here you can see actually when, when I tried to start this OSD for it crashed, but I don't have any much details here in the logs. Even the apartment logs is not giving much details. So, but fortunately you have the core dump of that file. So what I did to debug this issue is like just run this command using the what I said like safe ADM shell command and get just get access to that OSD block device you can say. And then run this command directly on the content inside the container to run that safe demand in the foreground command in the foreground mode you can say. But still it doesn't give what really went wrong. It did say from the call stack it's actually there is a PG meta data corruption, but it doesn't say which PT which PZ is associated with. So, so from the core dump you can actually find that one like either through the logs in memory logs or you can go through the kind of like the GDP debugging and find out which PZ is associated with this one. And whether it's it's because what kind of issues it's facing like it's here is basically like it's because of the approach corruption in the PG meta data for PG 1.10. The MCD is kind of like hex value you need to convert it to the kind of like the decimal. Yeah, to find out like it's 1.10. Yeah, that's all from my set. I'm happy to take questions.