 My name is Rono, I'm a software maintenance engineer at Redats in six years, and if you didn't guess from my accent, I'm French. So as a software maintenance engineer at Redats, I specialize in rail, and I'm dealing most of troubleshooting customer cases that are related to user space. So services that can be system-based services, system-D, YAM, YAM-DNF, RCSlog, so a lot of things. And usually, to troubleshoot issues, I love to use a stress. And I will tell you why. But because this talk is for beginners, I will go very simply on explaining you how a stress works. How, when a stress can help you, and when it cannot help you. And basically, we go through dissecting what a stress outputs, and I will go through examples of five or six cases where you can use a stress and how to use it. And finally, we will wrap up with some SELinux interrogation stuff. So what is a stress? So for the guys that are already following the stress sessions, they know. So please be quiet. Stress is a tool that enables you to see which system calls are executed by your program, your user space program, and prints back their value. So it works. You can have many, many options. Below are the options I'm using. For example, I'm always using dash F flag to follow forks, that is when a program spawns children or spawns threads. I'm using TTT to print values timestamps in human readable format, I would say. So some hour, minutes, seconds, milliseconds, and then delay in the state score. I'm using a V because I like to have decoding, much decoding of the C scores, S for to specify the buffer size of strings and YY to decode more stuff, basically. And when you want to attach to process, I will show you that later, of course, you can use P flag. So a stress is, I said it's all about C scores, but what are C scores? Basically, so C scores are a way to interact between your user process and the Linux kernel. There are other ways, but I won't tell it here. So basically, every time a user process requires access to a resource, it uses a C score internally. And this C score is wrapped in a glibc function, basically, for errors checking out other stuff. For example, when your program, which can be C or Python or anything, calls open, you have, it's internally, somehow we call C is open, which is a C score associated to that. When you, usually there's a simple mapping, but not always. For example, when you create a child using fork, it's not fork that is used anymore, it's more C score. But, well, you can have a list of all the C scores available on your system using C score to man page. So when I put that in brackets, that's the man page, the section, and you have a list, it depends on the architecture. And when you want to do more about the C score, so I just show you that you just use man on section two, and then the name of the C score without the C's underscore prefix. So how does it work? S-trace. So S-trace uses the P-trace interface internally, which somehow sets a break point on the process that you are monitoring, that we call callee. So every time a C score is entered, the callee stops and S-trace gets a notification. S-trace collects the data it wants to record for you for later, and tells the callee to continue. The C score's happened in the kernel. Upon returning, the callee stops again. S-trace collects the rest of the data. For example, the number of bytes it wrote to the file system or stuff like that. And S-trace prints you back to standard or usually to a file. Some nice lines that will help you troubleshoot, usually. One thing to note is that you cannot have two callers of the P-trace interface at the same time. That is, if you are already attaching the process to GDB, you cannot just trace the process you attached to. Usually it's not, it would be nice in some cases, but well, it's not the case. You cannot do that. So when the trace helps, but basically it helps when C-scores are involved. If there is no C-score, it won't help you. So basically you can have common hangings. So by hanging, we say you start something and it waits forever. You don't know where. It prints nothing. So that's the common hanging. Program waiting hanging, that's almost the same. But usually it's more when the network in Evol is involved, for example. You can use this trace to find, when you don't know a program, to find which file is processing, opening, reading, writing. That can be very various things. It can also help you see which libraries are linked to the program. For example, if some customer set LD library pass, you will see that other library can potentially be open and be used and that make your program fail later. And also if you know nothing about what you are trying to troubleshoot, you can also perform some kind of reverse engineering of the communication between the program and the rest of the world. Last thing is you can sometimes understand what triggers some specific error, but usually you have to check with the source code and try to first to match where the source code is and get more. So more use cases, but that are the use cases I mostly dealing with. So as I said, a stress is of no use if there is no C-scroll at all. So if it's doing, if your program is just spinning on the CPU, there is no C-scroll involved for that. It won't be of any use. If a C-scroll is not returning from the kernel, so hanging in the kernel, it won't help much. It will just tell you that when this C-scroll with, for example, this file descriptor or this file is being called, you hang forever in the kernel. So in such case, you have kernel tools such as traceCMD that can help. It's more touchy and it's kernel things, basically. See, stress may not help you when you have some issue related to some race condition. For example, you have your program with multiple threads and something is misbehaving. Usually, a stressing will hide the problem which gives you a tip that potentially you have some issue with some race condition. But that's all. Of course, when a program dumps core, a stress of no use, and when a problem exits just because it wanted to exit on due to some internal computation, it won't help you because you need always C-scrolls to be involved, basically. So recommended usage is not really recommended, it's my recommended usage. That means that when my colleagues ask stressis to customers, but they don't want to analyze the stress by themselves, they send it to me, and I like that they use these flags, basically. So stress has a lot of flags. I can't tell how many, maybe they are useful stuff, but not for me. Actually, I didn't know we hold the main page of stress in six years because it's too long. So there are too, too, too many possibilities. So when you want to stress a common and its children, you use FTTTVYY, for example. Dash O will store the result to file. Dash S, which is 32 by default, is how long the strings stress collects. It's collecting, it's 32, it's too small to be. Usually now I use one to eight. Here I put one K, it's good enough, but sometimes it's, well, the larger you specify this, the bigger the stress will be. And once you have a gigabytes of just one file of stress to analyze, it's not that optimal. So usually to my peer I say 1,024 and it's good enough. You can stress, so first case was I start my command under stress. So the second case is I want to attach a stress to a specific program. So the difference here, I'm not sure that it will work. So it's using Dash Q, basically. Dash Q, you can switch to multiple processes. Just be aware that if you attach to a program and the program already spawned children before, of course you won't monitor the children. When a process gains root privileges, for example, you do, don't forget to run a stress as root already otherwise, of course you won't see anything once the program became root. So for me, for my job that's easy, I always do stress using root, whatever the problem is. So now that we have the basics, let's see the output. So the output depends on the flags you use. So for all these outputs, it's FTTTYY, okay? So first you have the PID, the PID is a thread ID of the program that you are expressing. Then you have a timestamp with hours, minutes, microseconds. Then you have the c-score name and the c-score parameter and then at the end of the line you have equal sign means the c-score returned and you have the result. So the result is dependent on the c-score. Can be zero, can be something else depends on the c-score. And at the end, very end, you have the timestamp. So I think this is a big capital T here but I know that's true. So here, for example, in that example, we enter the people c-score, which were hitting on some descriptors and it returns on timeout after time 100 milliseconds. For those that use a VI, you can set a stress five type so that you get some nice colors. That's it, so two examples here. From time to time you get, why not? It's not from time to time. Usually you get unfinished resume thing. So basically in the previous example here we had the c-score on one line, one after the other but it's far from being always the case. For example, you will see this kind of thing. So the PID is a timestamp, the c-score and unfinished and later in some other line you get the result. So this happens when you have a multi-streaded program, really, or you are monitoring more than one PID. So that's completely expected but it can make analysis difficult. I think there are some tools available on the S-Trace website to pack things, et cetera, but honestly I'm never using that. S-Trace, so many c-scores return minus one. Minus one is usually an error. Is that bad? Well, it depends. There are many c-scores that are designed to fail in error. For example, when you try to access a file and the file doesn't exist, it will return minus one and set the air no and the air no is set to no supply of data. So in a way. So for usually the air no that you can skip and because they can make you think that reading S-Trace is difficult, E again, E intripped, E restart C, then E restart no end. In a way, no such file data, usually it's perfectly normal except for example if you try to execute of an executable, we also have been S-Bed through and doesn't find it. And so here in that case, usually the shell will try multiple location if you use just through as executable and not the full bus name. It will try values places where to find the program and in the end may fail or not. Another example, the program tries to open some library which doesn't exist. It says no supply of data but it continues so there is likely no issue. Real issues are usually when you get E-Perm or E-Access which means I couldn't access a resource. Something I didn't tell is that S-Trace also catches signals and it prints which signals were received by your program. So here in the example above, we see for example that PID 6138 was doing a select so it was waiting for something on the network, for example just sleeping and it got a sick kill so it was killed and that's it. You have other examples with sick terms, sick B. Sick sick term, it gives you, so when S-Trace process is the signal that was received by the query, it prints you details because of the actually I think. So here typically it says that you have sick term received and it was from the user space or some other process and here it prints who sent sick term, that was PID 1, so system B and then how it was running basically, so as would. Sometimes you get signals from the kernel, for example, so here that's the case where the kernel killed the program because of segmentation fault and there was, the program tried to access address nul, so it was a nul pointer dereference. So S-Trace show you a lot of things that are very interesting to troubleshoot basically. So let's go through the example. So I have six examples basically. So S-Trace saying a common slowness. So typically, so I see that all the time at customer systems basically they execute a command and it's slow, it works but it's super slow, it takes 10 seconds. With S-Trace it's very easy to see what's going on. So in that example here, we were executing DF and because the execvee syscall also knows the environment variables and all that stuff, it prints the environment variables, so I skip it and we can see for example that this DF program had in the environment LD library pass set to some SAP directory, which is not an issue. So you can check the main page from here and see how this was, how to match between the things. So here we have the pass name, DF, then the arguments DF-F-H and then the rest is environment. So it's brackets here and it goes to the next. So when you see something, this is a real example. Basically in the program, so it was taking a long time, so 10 seconds to execute DF and for S-Trace we could easily see that it was processing the LD library pass because it tries to open UC in various locations, which it couldn't find, which is not a problem because LD will then try other pass, but the issue was on the time spent to find out that there was no find, 400 milliseconds to scan a location. At that time, S-Trace is for no use anymore, but you know where to dig into. Basically you have to check why accessing USS-SAP fails but takes so much time to fail and basically for the small story, it was failing here because of some ottoman that was happening in the background and breaking. I see that I don't have the complete presentation too bad. Okay, another use case is S-Tracing SSH being slow or hanging, this happens all the time. When doing, so SSH, you need to remember that you have two parts, you have SSH and you have SSH-D on the server you are accessing. Many, many times I see people sending me S-Traces of SSH. As such SSH does nothing, just connects to the server and that's the server that does the job. So always remember to S-Trace the server instead. Usually I know also to S-Trace the client so that it's easy to match the timestamps and to see the connection, the port being used and stuff like that. So how do you S-Trace the server, SSH-D? Well, I give you some tip here. Basically I get the PID of the SSH server which is Unwell and Fedora VAR-WRAP, SSH-D.pid. I S-Trace, I start S-Trace, I tell the customer to do the SSH and then to control C, S-Trace, once he considered that it was too slow basically and then I check it. The other way to do that is to, that means that S-Trace will S-Trace all the activity of SSH-D. So all the connection, even the one you are not interested in too. So sometimes I ask the customer instead to spawn a new instance of SSH-D just on a specific port, AT-22 for example, and connect to that. But I don't like that, I don't do that much because the issue that you need, no firewall and no open port and S-Clinux to be configured to allow that and et cetera. And finally, so we get our S-Trace from the server and we check it. So initially we search for accept. Accept is a syshole to accept a new TCP connection. S-Trace shows you all the details. It shows the port on the local system, so it's on the local and the client and it's written to here because it's our SSH server. And some lines later, we see a clone. So basically we have SSH-D for a child to handle the connection. For search for this, search for clone and once you have the clone, you know that you are interested in this process and it's children. But usually there is no need to check for the children. It depends where the issue is, but basically you extract from the biggest trace you have all the children. This is the ID 23918 and all the children and then you can dig into that. So I'm giving you some, but I have better since then, but basically, that's the thing. If you use the SSH listening on an internet port and just listening for one connection, there is no clone because it's on the SSH service just for one connection, but we'll handle the new connection. So what we have here? Well, that's a real example as your, so sometime later in that a trace of our SSH-D connection, we can see that some message is sent. This is a diverse message of creation, blah, blah, blah. So this is persistently running to create a session for the user. No issue with that. So it's sending and then it answers, it waits for some answer. The initial answer is again, but I have nothing for you because I'm in non-locking mode and return immediately. And then we can see the code doing a P-Pole, which is basically waiting for getting a notification that there is something to be read on the file descriptor for which is used for the connection to D-Bus. From there, so we know that, so and you could see that the P-Pole C-Score failing timeout after 25 seconds. Once you have all this, you know that you are done. Basically, there is some issue sending the message to system D to create a session. It waits for 25 seconds and then it continues. Then of course you need to know some internals that SSH-D internally executes a PAM stack and in the PAM stack it writes PAM system D, which is responsible to set up the session, et cetera. So basically here, there was no connection to system D-LOB-D, which is in charge of telling system D to create the session. S-Racing Studio and SU, so program that become root, you have to proceed differently because as I said, if you are not root already, when you execute a trace, you won't see anything once the program becomes root. So what I do is I strace the shell that the user will use to execute the sudo command. So I get, I use equal dollar dollar to get a KB and then I attach a trace to dollar dollar. I tell the customer to execute his command at phase or slow. Usually with sudo assume, which is a bit redundant but that's a life and we get some strace. When you want to stress a debunk, for example, cron, well, you can attach to it and wait for cron to execute, basically. So that's very easy to manage things. A command mistake is similar to SSH which is when you want to strace a failing system D service. So I see people all the time stracing system CTL command, which basically does nothing as for SSH. It just talks to system D and system D does work. So to have something useful, well, you have to strace system D. So PID one. So procedure, you strace system D, you start the service, you control the strace of system D once you consider that the service failed and you check for, you try to extract from the biggest trace to get because system D was doing other things. What is interesting? So what is interesting when you have to find when system CTL was executed, basically, to say, hey, start this service? Well, how it works is very similar to SSHD. You have some exit 4c scroll on the socket, on the unique socket, and later you see system D creating a child for cron and later, again, you see system D, so it's a child of system D executing your service. Such case, SSHD. Basically, you will see as many exec VE as they are, so as many children as they are, exec start pre and exec start command. Then you need to extract. I have better tools as I said, we grab blah, blah, blah, blah, blah. And then you can dig into what you want. That was ancient time. So I was doing all manually. Now I scripted everything. So basically, you get the children of, you get the child of system D and then all the child of system D you are interested in too and then you retrospectly extract all the children of your service, in case it spawns, spawns, spawns. You can easily check using this phrase as well. If you could see some process so some services being killed or that failed with error or no error by just typing, basically. So some people will say, yeah, okay, you do a U.S. threat system D, you get a mess and then you do filtering. Why doing that? It's because, well, there's another way is to hack the service unit and just replace SSHD here by SSHD, your command and SSHD. That's bad because on rail you have a C Linux and because of that there will be some automatic transitions happening. We have a stress being labelled with bin, bin T and system D execute as init T. So when system D force the child and start executing a stress and not a security, you will become, so a stress will become unconfined service D and because of that, SSHD will then start SSLog and SSLog will run as unconfined service D which is not appropriate for SSLog. So probably your service won't fail because it's open bar, let's say. Whereas for SSLog, which is supposed to execute in SSLog DT context, it can, it can do less. So that's why I'm never hacking the unit type and just relying on SSHD and then filtering. Five minutes is perfect. So, SSHD boot activity. That's very similar to SSHD, basically. This happens, so I do that rarely, but from time to time. I want to stress the entire boot activity when I have no choice, basically. I can see for time to time that a service fails to start at boot, but once you restart it, it works fine. Why? It can have many causes. Usually it's due to when you boot, you have no network. If it's a service that starts early, you have no network or you are no DNS resolution, you can have the network but still no DNS resolution and stuff like that. So the thing, the easy thing to do is a trace system D as soon as you switch route. So that you get everything. Of course, your system will start slowly because a trace has a huge impact, but you get everything. And then you filter and you're good. So this is a small trick to do that. You mount, you will mount because after just switching routes, you have nothing. So first you go to init SSHD, so you get the prompt at switch route. I will mount in slash with a red white and then I execute a trace that somehow, especially with capital D, so that it becomes the wrong child. Why? It's because we want to have system D be PID one if we were just doing exact S trace, number last system D. S trace would be PID one and these things that it works, but it lacked because PID one is special and is used to reap processes that have no parent. It won't work. So use this, start it. So you will have your system D running and all the children, all the services, et cetera. And once you can log in, you have to kill S trace forcibly. Otherwise it doesn't work. I didn't check exactly why. I think there must be some signal on the issue, basically. And last thing is on SNLinux integration. Dimitri already said yesterday. So with recent S trace, you have SNLinux integration. That means you can see, that's very interesting when you want to learn SNLinux. You can see transition happening when you execute services. This is an example with RC's log. When you put dash dash SC context, with nothing, you have very small things, very small indication here. We can see that the child of system D that will execute RC's log, which is basically ID, initially execute in the context of the color, which is system D. So, and then it tries to execute RC's log, which is labelled differently with system D exact T. And this results in the next line when exactly finished this result to change context to system D T. So, if you start as tracing with the second text by hacking the service unit, you will see that you won't get this at all. You will become unconfined service T, which is bad, basically. So, it's on Fedora 36 later, rail 8.4 later. And for ancient guys that do not have this, I have on my public space some rebuilt version of a trace. So, latest at that moment I did. So, it breaks a bit, but with SC context integration there. So, it breaks because some decoding is not aligned with the kernels, but that's of no, not much interest. And that's it. I'm done. Questions? Looks like you are no beginners. I just want to be sure that the thing you did is upstreamed. Right? What? You provided a link to a separate... Yeah, but just I used the upstreamest trace and just rebuilt it for SC context. Yeah, so it just private banner is that I give to customers sometimes when I need that on rail 7, for example. No question? If you have question, you are sure you can talk later. Oh, so the question is I'm writing to a file when I want to stress the boot activity. Could we do that for cell console? Well, yes, you can because I think the cell console is already set up in slash dev, so you can free it. But, well, writing to cell console is never good because it's super slow and it's synchronous. So, writing to a file is nice. And also the files will be huge because you have maybe 30 or 40 services running in a shared boot. It's adjusted to create a memory P device. And so, all mountain peripherals. Yeah? Yeah, that could be nice. I want to touch a little fast. Okay, yeah. As long as you back port it to rail 7, that's fine. I'm interested. I suppose team members are supported here and there. Okay, that's it.