 This talk will present you how to debug boot issues on your laptop, mainly, and present you just simple techniques to fix boot issues. It's beginner level, so we won't go very in-depth. So this is Kyle Walker from Red Hat, Raleigh. Yep, I'm Kyle Walker. I'm a principal software maintenance engineer, and I work out of the Raleigh office, kind of, mostly remote, but I do go into the office on time-to-time. And I'm Ronald Metrice, working from France, but depending on a team in Czech Republic. So just before we start, I will explain to you what is our job in software maintenance engineer at Red Hat. So we hope that, if there are still students here, we hope to hire some of you. So in a nutshell, it's the best job in the world. So why is different from engineering? What we do is support. So we support frontline engineers, which are guys that take the calls, mainly, and try to solve issues from customers with what they know from the customer. And these guys call for help when they just stack. So that's what we call collaboration. So what we do, basically, we do root cause analysis of issues based on system crashes or application crashes, checking log file stuff like that, trying to understand. Our first role, first, is to help the customer. So we try to provide some work around so that his production can continue. And in parallel, once we have some work around and try to find the root cause, we try to reproduce the issue so that engineering fix the issue. And of course, we cannot tell the engineering, you need a 10-nose cluster with this running, this load, et cetera. So we try to give the simplest reproducer that we can. And finally, we create some knowledge-based articles that are visible by everybody or customers. Usually it's only for customers of Red Hat. And we explain the work around and et cetera. Optionally, you can do more. So you can propose your own fixes to upstream or to the developers at Red Hat. And also, you can work on upstream projects that you like. So you are very free of doing almost whatever you want. Yeah, very much so. And it's very interesting and rewarding work getting out there and seeing after something's been developed, it's been packaged, it's been tested, it's been shipped, and then seeing all of the infinite ways that the average user will break it. And from that moment, then hand over sometimes amazing descriptions of deep dive technical information exactly where it's failing and why it's failing and what happened before and what they did as a result. And sometimes the problem description is, it's broken. And so this job is really around working with Frontline to communicate to the customer, working with the customer directly, but getting ourselves into a position where we know exactly what's going wrong. And whether we can fix that immediately, we can fix it in RStream or we can fix it upstream, wherever the case may be, that's the job. And it's a lot of fun. So a usual carry pass is you can get hired as a Frontline. That is the guy that answers the calls and does a basic investigation. There are five levels for each, associate, normal, senior principal, and senior principal. So the Frontline, we call them TSE. And then you have also the backlines with the same grading. And there are, of course, bridges between the two and also bridge to development or to other jobs. All over the place from there. So let's start with the agenda. So there's two parts there. We'll show you how the boot works, the various phases. And then we'll show you how to debug issues with service initialization. So everything is based on system D here, of course. So the boot phases, I think, you can go. Yeah, so just to get started, just for audience awareness, it's vastly simplified. There's gremlins and deep dark magic in every bit of this. But just to start off, we talk about the boot as a lot of times there's the boot process that we see on the screen. But that's actually after a lot of things have happened. And CPU initialization, BIOS execution, UEFI, basically the hardware platform underneath us initializing, that all happens. And so debugging boot problems, it's really important to be aware of that. And sometimes when you're looking at a situation where it looks like there's no software running whatsoever, there's no software running whatsoever. It could be hardware, it could be microcode, could be firmware, could be user configuration still, but on the hardware platform. It's on something that we can't interact with from the software side. So we're not gonna get into that too much just because it's really, it's interesting stuff, but it's very long-winded. But where we start is really around the boot loader. Talking about shim and the interaction with the rest of UEFI boot, grub, kernel selection. We go on from there with the NITRA MFS operations, why we need it, and then the boot process, both pre-switch root and post-switch root. And then a little bit of a sightseeing tour around various different problems that we see all the time in support and how we tackle them. So yeah, to start off, there's a couple of fundamental differences at the outset when you're working with the boot process. And it kinda boils down to these two, BIOS versus UEFI on the Intel platforms. BIOS is really simple in a lot of ways. Usually the end user picks a device to boot from, from their hardware configuration platform UI. And then what the BIOS does is it goes to the first sector of that disk and goes, I'm gonna read the first 512 bytes and then I'm gonna execute it. And that's it. Everything else from that moment on is grub. And it's because grub puts a little bit of itself to get to the rest in that first sector. Yeah, it reads the master boot record, executes grub out of that first piece and loads the rest. And grub does all of the hardware-specific input-output operations at that point. So at that point, we have a fairly functional system, really. UEFI is a little bit different. The UEFI implementation is a lot more full-featured. It understands the EFI partition, the FAT file system type. It looks for and then loads the shim. The shim being the piece that is signed for secure boot, and for non-secure boot, it just defaults back to the next UEFI stub image. But in the secure boot instance, we have the shim there so that we can validate the boot all the way along. A lot of detail there, but basically what it boils down to is if you wanted to release a new version of grub and you didn't have the shim there, you'd have to go somewhere else to get it signed. And with the shim being in place, it's trusted from the primary key. It's validated, signed. And then shim, there's another downstream key that we sign subsequent grub iterations with, and that way when we get new updates, we don't have to go through the whole signing operation over again. We can do it in our build chain. Shim then executes grub, but there's also a couple of other interesting bits there. Anyone who's familiar with kernel module secure boot and kernel module operations, shim is actually where a lot of the last piece that is missed falls into place because when you create a kernel module and you sign it, you have to have a way of trusting that key and shim is the thing that does that for you. It installs it into the machine operator key ring, all that stuff, but shim is where all that happens. All the trust starts kicking off. And then again, grub relies on UEFI functions to operate. All right. Yeah, okay. So the bootloader, the bootloader, why do you need this? It just to select the kernel you want to boot and the associated parameters, eTramFS, boot device, console, et cetera. So grub, to be able to do that, to execute the kernel with a parameter, it presents you with a menu, which is well known and that menu is collected from slash boot, where it's stored and typically grub has some, is able to, it knows five systems. So it must know XT4, XFS, et cetera, to be able to read slash boot. And, but it doesn't know everything. For example, multipass is not needed because grub will just read from one pass and it's at the biased level or UEFI that will show which pass you will boot. And typically when something goes wrong, you get this nice picture, a nice prompt and nothing. And that's where most people panic, especially end users who aren't very familiar with grub operations. Grub, to is actually a really decent interpreter. It actually, or CLI, runtime environment. You can do an awful lot from grub as demonstrated. So this is from, so if this has been recorded, it's not live demo. Basically, we are the grub prompt. From there, you are not lost because there's a help first. You can just type help. And there's a lot of commands that should help you bring up the system. Typically, you have cat, which enables you to read a file and check the content. You have tap completion. So you cannot, you don't need to just know the pass to the event, just hit tab. And that's quite simple. You have some nice command like LS, et cetera, to see what's on your system. And here, typically, we didn't boot. We check why. We were checking for grub.cfg and in fact, it's an empty file. That's why it was not booting. So thanks for example, through config file, you can load some alternate grub file or specify the parameter manually. But here, we had some original file that we can cook. And from there, you get the grub menu. So the root cause being the config file was renamed. Yeah, exactly. And a different empty one. It was empty, yeah, it was empty. So when this is taken from a real example, this happens typically having an empty grub file when you have some hardware reset just while performing an update. And on physical system, if you have just a hardware reset, so the file didn't reach safe storage, you lose it. On VM, that's the same if you just kill your VM. So that can happen. And that's also happens during in BIOS mode when you play with the slash boot partition. Say you reformat your disk because you want to increase the size of your slash boot partition or move it. Then because grub has internal, is internally stored near the partition table, you can have such havoc. So how to fix this? Well, you can use, you can fix that from the menu, but it's really for experts, I would say. So the best thing to do is just boot the DVD and in rescue mode or put a USB skip key with rescue mode. And so there you have all the commands necessary for UF5. Typically you will play with EFI boot manager. And for BIOS, you will play with grub to install or grub to MK config to rebuild the grub menu. And so say now you have your menu. What happens next? So what happens next is you select your boot entry and grub will load the kernel and it's in memory and then it will execute the kernel. And you can go. Yeah, and from here we can then see the next part that usually fails. In this example, we have kernel mount the init ramfs as its root device from memory. So at this point, the kernel image is looking for the root FS and it can't find it. So we have a demo for that. So we just boot the system. Come on. Okay, here, that was some indication, but usually you don't see that because it doesn't pose. So your kernel boot, it tries to load your init ramfs as its root device and it just fails. So you get a nice kernel panic at the end. Yeah, you get a kernel panic with a really interesting message. VFS unable to mount root FS on unknown block. It actually couldn't find the root at all. All right, from there we get into the init ramfs. What is it? Why do we need it? A lot of history there, but basically the end intent is you get this really cool chicken or the egg problem with really complicated systems. The more complicated and more functional feature rich, the end runtime environment is, the harder it is to actually get it online. So what you end up with is a need for an init ramfs, which is basically a pared down version of the running system state that you can make available to that initial kernel as it initializes. You get a lot of things like kernel modules, configuration, system D actually in the pre-switch root operation. You get a bunch of different things in order to basically bootstrap that system into a functional enough state to get access to the rest of the box. So yeah, it is basically responsible for getting the root file system into sysroot. It knows what to do from the kernel command line arguments. So what you end up in a lot of cases with is after you install, you'll have this root directive, which tells it where to look, and then you'll have a couple of other funny directives. The ones off the end are actually parsed by draket and actually delays things. As the system is coming up, modules are being loaded, storage is being initialized, not all the storage. But as things are coming up, you actually get these race conditions that pop in where the system is kind of ready, but it's not ready for the next step. So in this case, what we actually say is the root is on VG rel and logical volume root, read only, just for the pre-switch root. And then you have indicated that the LV for rel root is one we need to wait for. If that's not there, sometimes it will boot and sometimes it won't. That's a really ugly problem to have. This is another ugly problem that you can run into. In this instance, every once in a while, you rename things, you move logical volumes around, and you get into this state a long time later, which is that it's saying that these particular logical volumes don't exist. It's a draket timeout behavior. I think we have a demo for that too, right? Yeah, exactly. So let's just play the demo online. Well that works. Yeah. So the system is booting, but before switching root, it discovers some devices, and then draket waits for the root device to come up. Of course so. It takes up to 180 seconds. So draket waits up to 180 seconds. So typically after some time it will drop into a prompt and a kind of emergency prompt. And from there, at least on rail, I'm not sure how it is on Fedora. I believe it's the same one. You have the SOS report command and stuff like that. So you have a nice prompt and from there, yeah. Until the GIF loops. Exactly. And from that prompt, you will be able to check which device is missing and which device that means, if it's the root device or swap device or the device that are mandatory that you added to your kernel command line. Basically the debugging process boils down to, I know what the config says to look for, is that there? That's pretty much it. If it's not there, then the search continues on for where did it go? Is it a simple configuration file alteration that went awry? Maybe we named it something other than what we put into the config file. Or in some cases, especially for more exotic boot operations, boot over, ice guzzies and things like that, it might actually be gone. There's no network connectivity or no SAN connectivity. There's a lot of conditions where that is the next step that you wanna go for. So yeah, in a lot of cases though the simplest step is when you're in that emergency prompt, take a look at those messages that Drake had spewed and start looking for where those devices are, where the file systems are, what logical volumes you currently have access to, things like that. Yeah, so typical case where you will hit this is you rename your root file system your logical volume or better say volume group because you found out your volume group was not nice. Rename it from foo to bar, but you didn't update grub. In such case, the kernel parameter is still wrong. So Drake will just wait for the device and since it has the whole name, it won't come up. Another case is when you are installing a system through anaconda. If you are installing through the network and you don't allocate enough memory for the system to boot to install, so on rail seven or rail eight, you need 1.2 gig of memory. If you just give one gig, it has no space to store the NITRA MFS and the NITRA MFS stage two and because of that, you have no root device. So if installing, always check that you have enough memory and otherwise that's what Kyle said. It's typically when you want to boot through the network through ISKZ or NFS and your network is down typically. So how to fix this? Well, you can edit the grub menu if you know that to rename your device typically and just modify it for one time boot and then later you will fix your ATC default grub or alternatively if you don't know what's happening, you can again boot from the DVD and rescue mode and check your system what is wrong with your state. So let's say, drag out completed. We have now slash this route, which is a root device from your disk mounted. What's happening now? That's when we switch route because up until now we've been working out of memory, we've been working out of the NITRA MFS and the early, early boot stages in the cases of a NITRA MFS not being accessible, but now we actually want to start the rest of the boot process. After that switch route service operation happens, that's where we start trying to bring all the rest of the complexity into an online state. So first thing, system D reexecutes itself. That's subtle, but it's important to know because there are instances where the system D that's in your NITRA MFS is not the system D that's on disk. It's rare, it's almost never a problem, but it is good to be aware of it because in some cases you might have subtle odd behaviors where one version is running in the NITRA MFS because it hasn't been rebuilt since an update of system D. At this point it starts using the real file system, root file system, it checks for the user entry and at the FSTab mounts it. Lukas. Oh, that's right, yeah. User is mounted in the NITRA MFS, so it's one of the few holdouts. Yeah, from there it starts trying to find everything else. So we get into the FSTab, which brings us to the next point of executing generators. The really nifty things, the way that system D does this is ordinarily admins or users are familiar with creating unit files, creating configurations. Generators are more geared towards dynamically creating units, dynamically creating configuration. The best example of this or the most user friendly example of this is the Etsy FS tab. There is a generator that runs early boot that reads the entries of the Etsy FS tab and creates system D unit files, mount files for each one, so that it's then natively understood by system D. And then it gets into the dependency tree and we'll have some examples of where that can go right later. And then one of the best parts about it is that it starts bringing services online and it does so in a dependency based manner. Legacy systems had this wonderful quirk where, well, it was a feature where one thing would start and then it would finish and then the next thing would start and it would finish and then the next thing would start and it would finish. And if you put a sleep anywhere along the way, the next thing wouldn't start until the sleep was done. It was really frustrating because you'd be on this wild goose hunt through your service initialization scripts trying to find out why is it slow and why is everything else down. System D doesn't do that. The units and the various different configurations components that it understands are tied together. Either via requires or wanted by configurations that we'll talk about later or ordering operations. What you get as an immediate benefit is a much faster boot because you get to tie things together that need to be tied together and don't tie anything else together. So here, now we go to debugging system D. Basically, debugging service initialization. So far because we show you this, we'll have three points here. We'll explain you, introduce you to system D units and targets and then we'll show you basic tools available to debug boot issues and finally we'll show you use cases. You can view it as a subset of the system. It's really just the absolutely necessary bits of the system. Yep, it runs system D. The configuration looks a little bit different from within the init-remfs. So if you did something like you added rd.break to the kernel boot parameter line and dropped into that environment, you're not gonna see the same services, units and whatnot that you'll see later. But yeah, it's all there. It's still running. It's used as the primary mechanism for rel7 and beyond and modern fedoras in order to implement that pre-switch route environment. So let's introduce bit units and targets. So first, when you are dealing with system D to understand all the machinery, there are wonderful man pages. So typically for units, which is a generic system D object, you have the system D.unit man page and for every type of object you have, for example, services, sockets, you have the corresponding man pages with everything described and examples, properties you can set, et cetera. So a unit is just a file. It's describing a system D object. For example, you can have sshd.service, which describe the sshd.service. RPC bind socket is, which describe the socket RPC bind will listen on for NFS months, for example. And because of these files, so you have a kind of precedence to deal with because basically at the lowest level, you will have usrlib system D system and then your unit file. That's what is installed by packages. If you want to do customization, then you will install your own files in etc system D system and this will just override. The base file, et cetera. So the top priority is what is in run system D system, then run system D generator and then etc system D and finally usrlib system D. When system D sees a file in one of the territories, it will just ignore the same file in the territory below. Alternatively, you can define drop-ins. So drop-in, what is it? It's just a small file, a .conf file which enables you to customize a unit. For example, you will want to, for a socket, HTTP socket, listening usually on port 80. If you want to change this, you can create a drop-in just to make it listen on port 81, for example. So it's the same hierarchy. That is, you have the same precedence and this time to create a drop-in, you have to create a file like unit.type.d and then something .conf which will be loaded in alphanumerical order. For example, if you create a drop-in for SHD, you will create it in that path, etc system d shd.service.d and then .conf. So drop-ins, you can also have drop-ins for main configuration file. For example, for the etc system d system.conf file, you can create your drop-in in that directory. For the journal, it's the same. So there must be others, of course. Every system d compliant tool should be able to understand this kind of drop-ins. And of course, there is some man page for that. It's maybe system.unit, not so sure. I think it is. So some special unit you can have is targets. So usually, a target is somehow similar to what you could have in it. That is a kind of synchronization point, a run level, for example, run level in it, run level five, which was bring up the graphical interface. Yeah, so a target, usually it's used as a synchronization point. For example, you can have localfs.target, which is a special target that just says, you must have, at this point, to reach it. You must have all your local file system mounted, local file system from etcfs.tar points of view. You can have targets that don't have files associated, that are all the system.special targets that are described in that man page. So initially, when system.d starts, we told you it creates some dependency tree. So to be able to do proper ordering, for example, if you want to start HTTP, maybe you will need the network. So you have to start the network before HTTP.d. This is managed by keywords such as after and before to say, okay, my unit must start after another unit and system will try to manage to do that. Additionally, you can have requires and wants, which are dependencies on other units. So there's a difference, require is a hard dependency and wants is a soft dependency. That means that if your unit depends, say, on SSHD, it's a, and it's a require. If SSHD doesn't start, it will just die. But it will really require it. So if you didn't install SSHD on your system, your unit won't start. With wants, it's only if it's installed that it will be taken into account. And similarly, you can install dependency using required by and wanted by, which is the same. So typically, when you want to have your SSHD service started at boot, which is not the default and some services on some systems, you will say, I will install SSHD. And by installing, you will say the multi-user target, which is many normal system behavior with our graphical interface. It wants SSHD and then because system D will boot into that target, it will try to start SSHD for you, automatically. If you disable that, your service won't start. Okay, so yeah, here's just a quick example of pulling that all together and very common configuration that we see. Like here in the SSH daemon, we have an after directive for network.target, not to be confused with networkonline.target. Because in this case, SSH daemon doesn't actually care if the network is online, it just cares that we're trying to bring the network online. We also have it after the SSHD-keygen.target. That's a target that pulls together a couple of different things to make sure that the host keys are generated. So in one line, we've been able to say that it needs the network and it also needs all the things that set up your local host keys. Really powerful and also kind of subtle. It's good to be aware of what, especially the after and wants mean. And also the difference between wants and required. Because in a lot of cases, it's easy to transpose them, but if you require something and your service fails to start for any reason, it's required by the other thing and it fails to start. So it's really good to be aware of the difference. Once SSHD-keygen.target, it just wants that target. It's going to have as part of its wants, look up operation that target. And then also install wanted by equals multiuser.target. That's just indicating that when we try and get to the multiusers.target, the special target, which is similar to run level three in legacy systems, we want to make sure that the SSHD-man is there. Or at least that's what we do if we install this particular unit. If you don't install it, if you don't enable it, none of that happens. And if you disable it, the inverse, it's just removed. And for more about how the boot works in this manner because there's a lot of subtlety in how the various services and units all tie together and how this configuration kind of pulls the full dependency tree together. Check out the man boot up seven page. It's fantastic. It's got ASCII diagrams and everything. It's beautiful. If you want to see customizations. So we talked a little bit about all of the drop-ins and the prioritization of which configs override other configs and where they can be. That system D Delta is your friend because what system D Delta does is it checks to see what the OS provided ones, the vendor provided configs look like. And then it looks at what is actually running in terms of the drop-ins and the overrides and sees there's a difference here. And it just prints that to the console. If you have weird behavior in a specific service and you run system D Delta and there's a difference for that service, pretty good indication that you are on the right path. Also you have system CTL cat. Since all these files can exist in different places and there's drop-ins and everything else, system CTL cat is great because what it tells you is what the end configuration looks like for your unit file. You run system CTL cat unit and all of a sudden you see this is what the config actually looks like because it pulled one config from the vendor provided files and then it pulled a drop-in from Etsy. It's fantastic. Really, really subtle but important note because it kind of gets lost a lot of times. After a change in a unit or Etsy FS tab always reload system D, especially if you change Etsy FS tab. Because of the generator behavior, how it goes ahead and loads the config, parses it and puts it into unit files. If you don't reload it after the FS tab is altered, the new state isn't active. It could become active later, implicitly via enable operations or things like that but then you get into a weird state at runtime and you always want to avoid that. If you make changes, always reload. A couple of other options that you have available. The really, really powerful ones, the really big hammer ones are mostly on the kernel boot parameter line or the ones that we use in support work at least. Debug is great. It's read by a lot of things. It puts a lot of things into debug. Quick note, if you have a serial console enabled that means an awful lot of output is very likely going to the serial console which can cause an awful lot of problems. You want to do that carefully. Definitely an option but you might have to reconfigure, tweak your serial console. Other options, system D specific. We have kernel boot parameter options which are documented in the main pages. System D log level lets us just set log level to debug for system D. You have the log target here which we set to K message and log buff lend for 15 meg. The last one actually isn't a system D option but it pairs nicely with the previous for K message because with K message we're actually putting information out of system D into the kernel ring buffer and the kernel ring buffer by default on most distributions is fairly small. We want to increase it. So that's what that last option comes out of. And then system D debug shell equals one. It's great. If you have weird problems that are coming along it's great to be able to just add this to the kernel boot parameter line because you get a console on TTY9. You can flip over and start poking around as the system's coming up. Boot analysis, you also have system D analyzed plot. Creates a really nice SVG graph of all kinds of fun things. It gives you a lot of visibility into not only how services were started but how they parallelized and maybe where you have ordering problems. After or before kind of gets away from you. That will almost invariably display it right away. That was called boot charts in the past. It was, you had to rebuild to enable boot chart. Well, boot chart actually gets you more information to enable it. It's sufficient. System D analyzed blame. It gives you the critical chain and timings associated with it. It's really great if you have a slow boot scenario. It starts telling you this particular service took forever to finish and it is ordered before this particular target. I need to start inspecting that. Why is it going wrong? Dump gives you tons and tons of information. Basically all the running state as system D sees it. Footnote being that it's usually for experts because at that point it's emitting a lot of internal information that might give you a for trying to find the forest for the trees type of scenario in searching. Logs are great. System D journal starts really, really early in the boot and it's really resilient. So if you're looking for something that's going wrong and it may be scrolled by really fast, check the journal. This couple of flags to keep in mind are dash B tells it that you want to look in this boot, not any others. And a couple of output flags that are helpful for us in support work. A short precise being that we want output that gives us a really high resolution time stamp for when the events came through because they can get out of order and it all looks the same when it's resolved down to a certain decimal place. And verbose gives you an awful lot of information about the metadata that the journal picked up about the message because the message comes through and in legacy systems, that was pretty much it. You got the facility and you got the message. In journal, it also gathers other information like where the message came from in terms of services, comm name, all kinds of fun stuff. So yes, when comes with some service startup failing, so we are not in the boot process anymore, you try to start a service and it's just failing, you can enable full debugging just for that service using that wonderful system-level environment viable and then you just tell system-level to restart your service and you will have debug only for that service because if you enable it globally, you will be overflowed by traces. Otherwise, if it's, you can't find with the message print by your unit, well, it's real big hammer, but it works, you can just extract system-D. So you extract system-D telling to follow the children using minus F and then you start your service. You will see system-level fork, your service, et cetera, and you can just follow what is happening and try to find usually, you will see some, I don't know, permission denied or files or stuff like that. On the EC Linux part, if something goes wrong, it doesn't boot your system. Well, it boots, but there are some failures. It's good to try without being in enforcing mode. So in permissive mode, so you just have to add enforcing equals zero on your kernel command line and your system should boot. If it boots completely fine, that indicates that there's something wrong with EC Linux. Don't use EC Linux equals zero on the kernel command line. Why? It's just because it produce a mess. Yeah, once you disable EC Linux, it's really hard to turn it back on right. If you just put it into permissive mode, you're okay. If you disable it, things that you create from that moment forward don't really get a context. So when you turn EC Linux back on, suddenly you have weird breakage creeping all over the place. Yeah. So now we have four use cases. The first use case is when you have a block device timeout. So typically you have some one point which times out at boot. By default, the device timeout is set to 90 seconds. It's tunable, of course. You can tune it on the kernel command line using that system.defaultTimeoutStartSec equals some value. So it's a bit long, honestly, and if you don't have a US keyboard and you need to do the transition because you are inside grab. It's in US keyboard, in US key map. That's not so easy, but why? That's like this. So basically every time you can put some value, you can put zero, which means infinite. But don't do that because it's very dangerous. Say you have some local moment point that doesn't come up. Local moment point are supposed to start before SSH. If you set zero, it won't show up. You won't ever get a prompt to debug. So you can set it to a larger value, but don't set it to zero. Similarly, if there is only one moment point that fades and you know that it's planned because I don't know the device is very slow, you can set up in FSTAP for that amount of, for in the amount options. Existently device timeout to some value don't set it to zero because you will have the same issue. But generally, don't add netdev, underscore netdev blindly to FSTAP when there is something wrong with the device. It's usually wrong. So netdev has the effect of delaying your mount until the network is there, but it's not necessary. So when can you have block device timeouts? It's again, say you rename your logical volume group, but you didn't update FSTAP, you will have some device timeout when the system boots. So of course, if it's the root device, probably you would have it the issue in the Netram FS, but if it's some OPT, for example, you wouldn't find it in Netram FS. You can also have this issue if you say you removed the disk, but you didn't update FSTAP, there is a mount point on it. You didn't notice it. You will get some failure mounting the device, and hence get the emergency prompt because it's a local file system, typically. Yeah, so this is another instance that we see very, very frequently. Service is not starting at boot, but then once the box comes up, you start the service and it's perfectly fine. Almost invariably, you've got an odd ordering or timeout issue that's coming into play. So checking for missing dependencies or checking for ordering issues, usually your first stop is your journal. Limit it down to just the specific service. That's what the dash U flag gets you. So this basically means get me all the logs from this boot from this service. Check your config because a lot of times, due to the way that configs can be overwritten based on reading out of Etsy system D system as opposed to the vendor provided, you might have an unexpected end configuration state because of some admin operation that happened earlier. Also check for jobs that are still being executed because the system is dependency-based, things tie together, you might actually have the system still working on it. You might see list jobs and see that there's some things that are currently underway. You can then start inspecting exactly what's going on with them, check the logs for them, all of that fun stuff. Create a drop-in to delay the service startup if necessary. System CTL edit is a nifty little utility. It creates the drop-ins for you. So system CTL edit, the service, it knows the read order for system D and so it'll put it in the right place for you. And then check to see if the service unit is assembling to a remote location. We actually see that a lot. Folks will put sim links in place and then the unit file is actually on an NFS mount or something. It's pretty good to be aware at that point that most of that config needs to be locally accessible. It shouldn't really need other undefined dependencies. So yeah, check the journal. Look for unit not found. You can get false positives, but yeah, this is an example, cannot add dependency job for unit customer service ignoring, but unit not found. Sometimes you can have some random service not starting at boot. This most here, so you boot once, you have some random service not starting, you boot another at the time. It's another service. So typically this happens when you have ordering cycles. And you can check, so you need to check in the journal if you find something like breaking ordering on form dependency on with a stack. You're pretty sure that it's an ordering cycle, but because system they cannot work with ordering cycles, it just will kill one of the job to break the cycle. And it does that dynamically. So basically you never know which job it will kill. So this was some example. Usually you create ordering cycles when you remount some remote file system. So here, slash SVC tools. When you remount it using a bind mount as a local file system. In such case, if you do this FS tab, because system is not aware that that bind mount is in fact also remote because it refers to an FS file system. It will try to pull it very early and it will create the ordering cycle. So in such case, that's the only place where you will use netdev to say, okay, this bind mount is a remote file system. So I will delay it. So we'll skip the demo because we lack time. It was not really interesting and we can finish on that. Yeah, just an example of AVC breakage that we see quite frequently. If you have those odd AVC messages that scroll by, not like one or two where it's coming online, but a lot, that's a pretty good indication that you have SE Linux problems that are baked in. So yeah, AU search dash MAVCs, just limiting for AVC events and timestamp this boot iteration. That'll get you the logs for this boot that indicate an AVC violation and then journal CTL, this boot, grep unlabeled or permission denied. Those will start indicating to you where a service failed because SE Linux gave a permission denied indication back. These usually happen because, again, SE Linux was disabled at one time, especially if they're in the primary paths. If they're in non-standard paths, like a custom Apache web server or something like that, where the paths are different, it might mean that you have to start investigating file contexts, mappings in SE Linux and things like that. This fix by using is RestoreCon. Basically says fix it all under the root file system and easier is just touching the auto relabel. That indicates to service early boot to go ahead and relabel everything and then just a reboot to get it started. Yeah, we ate up most of our time. But yeah, any questions? Go ahead. So the question was, do we have any special advice for debugging shutdown issues? Yes. One of the first things that I recommend is taking a careful look at the ordering on startup because the inverse is true on shutdown. Sometimes what ends up happening is if you don't have enough of a dependency defined between two services or an ordering defined between two services, there might be an actual requirement under the hood. One that usually comes up is like an NFS mount. An application is using an NFS mount, but the application is not ordered after the network or the remote file system target. On shutdown, there's the possibility that the network and NFS will be torn down before the application stops or attempt to. And then the job to stop the network fails because the device is still busy and all of those things. So the usual process or the recommendation is usually take a careful look at the startup, find out what is needed in order to get that service up and then just follow the reverse. If there's something missing on the teardown path, add it into the startup path. So what can be very useful in that case is to set up a persistent journal so that on the next boot, you will be able to read what happened during shutdown until very, very late. Okay, I think one more question, one quick question. Oh, we could talk after, does that work? All right, I think that's it.