 So hello friends. I'm Abhijit Singh. I work as a software engineer at Rupur. Today I'm going to talk about the potential of raw trace points in the Linux kernel. So I have only 10 minutes without further ado. Let's begin. So let's talk about what static trace points in the kernel mean. So think of this. So in the Linux kernel, it provides various hooks at specific places to call a custom function at runtime. So basically, which means that in the kernel code, we have accessibility to plug into those functions and define our own functions, which can get visibility into what is happening at that particular function. So these hooks which in the Linux kernel provides us are basically trace points. And these hooks are defined at important events and these are predefined. So this is not dynamic. These are predefined. These are predefined points in the code where we can plug our custom code and see what is happening in that function. So for example, we have the sked switch trace point, which is defined when a process context switches. We have this sked wakeup trace point, which is defined when a process is woken up. So at various interesting places in the kernel code, we have these hooks defined. And when we define a particular function to plug into these hooks, we can get to know what the system is doing. So how do we plug into these hooks? How do we instrument what is happening at these particular hooks? So we can instrument them via BP programs, F trace, existing infrastructure that is available in the Linux kernel itself. So how does a trace point look like in the Linux kernel? So let's try to explore this. So we have the trace event macro. So this defines a unique trace point. So here you see sked switch. So it means that this is a unique trace point. You have the TP proto macro. So what this defines is that whatever function we write to plug into the trace points, that function has to obey this particular signature. So if we define a function which wants to instrument some particular data happening in this particular trace point, then the type of the function has to be a Boolean and then a pointer to a task set and then another pointer to a task set. So only then will our function be valid and we will be able to hook into this particular trace point. So here you see the previous is a previous task set pointer and the next next is the old one to be scheduled. So what if we don't define our particular hooks? So will the trace points still work if we enable it? Yes, it will work. And we can also see the trace data at this particular location. So what will be the information that will be logged when we enable a trace point? So by default, the information that will be logged is defined via TP struct entry. So here you see a bunch of fields being defined in this macro. So these are the particular fields which will be instrumented or logged in the buffer and we can have visibility when we enable this particular trace point. So basically each trace point will define its own TP struct entry macro and will have its own TP struct will have its own information being logged to the trace buffer. And that format will will differ for different trace points. So the fields which are being written here, they obviously depend on what arguments are going to be passed on the trace point. Now what we notice here is that there are only six fields here, seven fields here. So a task truck can have a lot of fields and we are definitely not logging those attributes as well. And there's there's a reason because the trace buffer is a ring buffer and a ring buffer is limited in capacity, memory capacity. So we cannot, we cannot log all the attributes of task truck. But if we want to do that, or if you want to log few attributes which might be interesting to us, then the kernel, the, the, we can use raw BP of trace points. So here, I mean, a person, a user might be interested in logging the memory descriptor or the non voluntary context switches, or the runtime V runtime attributes of the task truck, and the default macro doesn't log them. So we have to define a custom program to start using or start starting logging them. So our trace points enabled by default. So we have to enable a trace point by running this echo command. So once this echo command is run the trace points are enabled and the logs or the, the information starts to get accumulating accumulated in the in the in this particular location. So we can, we can, if we look at this particular location, we can see the logs being instrumented by the kernel, if we do not define a particular hook. So if we define a particular hook for a particular trace point, then we can get our own type of logging, but by default, the kernel will use the information contained from TP stock entry in the logs. So this is what this is what the logs look like in the tracing buffer. So as you can see, on the right side of schedule switch, we have the, we have multiple attributes and their values being logged. The name of the previous process, the name of the previous, the previous bid, the priority of the previous process, the state of the previous process, the next name of the process and the next bid next priority. And this confirms. So this confirms to the information that was defined in TP stock entry macro. So if we want more attributes or more information to be instrumented, can we do that? Yes. So if we define BPF programs with BPF raw trace point type, then we get past all the arguments which are defined in a trace point. So here in this case, let's say if we define a BPF program that that BPF program will get past preempt flag, the task pointer, the previous task pointer and the next task pointer. So whatever information is there in the previous task pointer and next task, task pointer, we can instrument that and we can serve our purpose. So basically all the attributes which are accessible via via this, these arguments are available for instrumentation. So here I have actually written a BPF program to compute the time slice. So what do we define a time slice as so consider these two particular timestamps timestamp one is when a process is scheduled to run on the CPU and process. The timestamp two is when a process is preempted from the CPU from running. So the difference between these two timestamps is a time slice. How do we calculate it? So basically what we have here defined is a function which which plugs into the sketch switch trace point and which gets past this particular BPF raw trace point arguments parameter. So the first parameter of arguments context arguments is a task of pointer and we can convert this to a task of pointer and we have information now from we have we have information now from previous variable, a variable, which is a pointer to the task type. So we can access the pit, we can access the u time. So what is a u time defined as the u time is defined as the total number of the time the process has spent in user mode, the total time in the CPU. So basically if we so whenever a process is scheduled on the CPU and whenever it is preempted. So if we take a difference of the u times, we should get the value of time slice. So when is this trace point? When is this function called? So this function is called whenever the sketch switch event happens in the corner and that event happens whenever a process is context which happens. So any context which will trigger this particular function call and in this particular function call, so what we are doing here is that we are calculating the u time of the previous process. We are getting the previous u time of this previous process by looking up from a map. So this BPF map lookup lm will lookup from a map for this previous pit and return the previous u time. So the difference between the current u time and the previous u time will be the value of time slice. And once we once we compute the value of time slice, we can we can instrument that by the BPF trace print k function. So here you see the string that is being instrumented is different. The information is also different because we're instrumenting the pit. We're also instrumenting the time slice. Also in the middle we update the map as well because we want to keep the map upgraded to get the correct time slices whenever the context which happens. And the map is defined in the top section. So it is a standard BPF array, which is shared across all the CPUs. So because we want consistency across all the CPUs. So that's why it's shared. So why is it important? Why? Like if we if we are able to write BPF programs and which can use ROT response, why is this important? Because it basically unlocks a lot of possibilities. So it enables deeper debugging deeper observability into what a kernel is doing. It is also a bit more efficient than the current tracing tools available in the kernel. And if we if we are able to fully leverage the attributes and the events, we can explore a lot of possibilities. So let's look at the real world examples of this. So BCC is a BPF based Linux analysis tool. So it uses response, ROT response to precise for various purposes. So run queue lat is a tool which computes latency of the run queue for a process in the Linux scheduler for each CPU. So what is run queue latency? So you have two timestamps. Let's say the first timestamp is when a process is enqueued on the run queue, the scheduler run queue for a CPU. And the second timestamp is when this process is scheduled to run on the CPU. So the difference between these two particular timestamps is the run queue latency. So it instruments the run queue latency for all the processes in the system. So run queue slower is a variation of run queue latency. Just a minor variation is that we instrument only when the run queue latency exceeds a particular threshold value. So once it exceeds a particular threshold value we instrument that particular process and latency. This might be quite interesting for us because we can try to check if there are any problems with the scheduler. Let's say if the run queue latencies using run queue slower are greater than 100 microseconds or 1 millisecond then it definitely should be a problem with scheduler or the scheduling process. So if policies might be incorrect or they might have another issue in the system, this helps us, like this helps us eliminate that if this run queue is a cause of concern or if it is not. In this way several other tools can be written and can be imagined which can work or which can be based upon rock response. And this is an example of that. So that's it. All in all I want to end is that using BPF and rock response. We can enable deeper observability into the system. We can instrument what the Linux kernel is doing with a lot of information that is not available via the default instrumentation provided by the kernel. So if we are able to plug into those hooks with our own BPF programs, we can do a lot of deeper observability then and do a lot of interesting analysis around what events are happening in the system. Thank you. Thank you for these questions.