 Here we go. I'm Daniel. I work for Red Hat. I'm a blah blah blah software engineer there and I work in the real-time team but I actually love to work in the university and I actually work from my university in Pisa where I'm living now and I'm trying to connect those real-time and Linux kernel and now runtime verification and the Linux kernel and Red Hat believed on my ideas and now I'm working on this and let's see what you all think. So Linux is critical. There are many people saying that in the future we might use Linux on cars and as a kernel developer I know that the kernel crashes and the cars and crash isn't a good thing, right? And another point is that Linux is complex. Later we will see that model for a part of the things on Linux can reach a huge number of states and it's even hard to explain those things using words because they might be contradictory and it's not easy to translate those things. But because of the critical size of Linux and the complexity of it, we need to be sure that Linux behaves as expect. But what do we expect from Linux, right? Okay, we have a lot of documentation explaining what we should expect of deadline scheduler, for example, or the print RT. We have many different languages. I've learned Linux in Portuguese, for example, and we have a lot of assertions in the code that say, okay, if this happened, I generate a warrant on. You guys, nobody should use the bug on anymore. We learned that. And we have a lot of test cases and fuzzing and all those stuff to try to check if the system behaves. And all these things are good. But we need something more robust. For example, how do we check that our reasoning and how do we check that what we are writing is correct? And how do we check that our asserts are not contradictory? For example, in the real-time community, we say that we do not call a scheduler when we have a preemption disabled, right? That's a common knowledge. But actually, the scheduler is called always with preemption disabled. So these kind of corner cases of understanding, it's hard to assert on our natural language. And how do we check that we are covering all the cases on our tests? And we are expressing all the behavior of Linux. And how can we verify all these ideas? But the main point when we go to safe critical systems or when we go to talk with mathematicians on the real-time community, for example, is that how do we convince people of our properties? It's easy to convince myself, Daniel, that Linux is a good thing, you know, and it will be very easy on this audience. I would not need to explain many things, but when you go to use Linux on cars, you need to convince certification authorities. And they would like something more than just a Linux guy explanation. And what people in computer science say about it when we need to explain complex behaviors, right? Stop working. People say formal methods. That's the first thing that comes to our mind. And we have some successful example of formal methods applied to the Linux kernel. We have the memory management, the memory model that has a lot of examples and catch problems on memory model. We have some somehow formal lock depth. There are many working groups using formal language, for example, people on arm, Gatley Marinas, like he wrote this pain lock implementation and used temporal logic to find bugs. So we have some good examples of formal methods applied to Linux kernel. But we need a more generic and intuitive way for modeling other parts of the kernel or a more broad behavior, not just a simple behavior of a subsystem, like memory or locking, but a way to connect many subsystem on a higher level or in design level. So how can we turn this easier? The answer is like it's using a formal language that looks natural for us. And how do we naturally observe the dynamics of Linux? We trace. We have a lot of tools for tracing. We have F-trace and we have Perf and now BPF and we can do a lot of magic with F-trace and with tracing subsystems. And while tracing, we are inside our minds, create some sort of state machines. And that's natural for our operating system developers because we read a lot of books saying the states of a task on an operating system, right? This seems to be one way to go or an intuitive way to go. Like I just copy and paste this from the gigs for a gigs. Yeah, that's a very computer science need. So when we think about state machines and formal methods, one possible solution is the usage of automatic. So we can use automatic for many things but also to specify event-driven systems. And on event-driven systems, we see the evolution of the system as an event or a sequence of events that forms the language that that system talks. And we can also do use it for around time analysis. So this is a simple example. Let's say that we want to write or to specify how a client, a network client, a bogus network client works. We can say that we can open and close a connection and then we can write a request and read the response, right? This is a very simple case. And it's easy to read in this format. But the good thing about a formal method is that we have a mathematical description of what we specify. And here, I will not enter in the details here, but in the end, one automata is a set of states, for example, a set of states, a finite set of events, and a state transition saying that, oh, in a given state and we've given an event, what is my next state? So we have everything specified, deterministically specified. Automata allows some kind of implicit verification. Like I can check if the reason I'm expressing is that lock free, lock free. And it also allows you the model or development of system by using operations, mathematical operations. And we will try to do an exercise on this. So this was the previous example. And we can model these as a separate set of modules. So we can say that one module is the open and close. So I can open and then my socket will be open. And I can close and then it will be closed. And then I can be ready to write and then I write the request and wait the request and read it. So prayer is straightforward. Then I can synchronize these two small models and create a big model that represents all the chain of events that are possible or that might exist on an unsynchronized version of the system. There will be a set of, it's hard for an Italian to speak, it could stop and I need to move. So here we have a set of events that we expect, like open, close and read, write, and a set of events that we don't expect. It's not correct to write before opening. And it's not expected to close and open while we are in the middle of a connection. This is just an example, right? So then I get events of two or more generators and put inside a small model to synchronize the operation of these events, what we expect. So I say that I can close again. I say that I can close while I'm not in the operation. And I can say that I can only read and write after opening this socket, okay? Makes sense? Simple. And then while modeling there are some tools like here I'm using Supremica, which is academic tool, but it allows to do anything I need. And it says, okay, the system is blocking. This is an implicit verification that gives me evidence that system is either with a live lock or a dead lock. So here I take the model that is creation from the synchronization of our events and I see that there is something strange here because I can open it, but I cannot return to the initial state. So there's something wrong with my specifications. And what was wrong is that I can read right after opening, but I cannot after I close it. With this all synchronized we have the system I had at the beginning, right? Why? Why do you not draw it, Daniel? Yeah, that's easy. Well, why all these problems? Because Linux is complex, right? Linux is way more complex than just three states. For example, I modeled the behavior of the preempt RT and just for the single core case, using the level of abstraction that I was using was not the code. It was the synchronization level, including IRQ-DZ bar to enable preemption control locking and nested locking and scheduling and so on, on the level that we as real-time developers work. And I end up having a model with 9,000 states and more than 23,000 transitions. I could write this, but not in the time of my PhD. And I think I would need some pens, like some two or three to draw it. So just drawing the model by hand, it's not sufficient when you have a complex system like Linux. Using this modular approach, I could break down all the problems on the 12 generators, like small subsystem independent, and 23 specifications which in the end are properties that I wanted to express. And during the development, we found actually four. Now, books that would not, no, three books that could not be detected by any other tool available on Linux. So let's just have a look on that case, right? So I can set need resched when there is a newer higher priority trend. I have a task on a sleepable state. Then if I wake it up or itself sets its state as runnable, I go to runnable and return to sleepable state. I can call the scheduler and return to the scheduler. I can disable preemption to not call the scheduler or disable preemption to actually call the scheduler. I can disable IRQ to avoid IRQs or to while running IRQ avoid the main transce, because IRQs are not preemptible on Linux. We can get a task can switch in and out. And these were the generators. And now I'm starting to expressing and joining all those previous automatas, I would have a view of all the transactions. And I will be sure that it would not be forgetting anyone. Now I start to talk the properties that I expect from Linux. For example, one property is that the scheduler will never be called with IRQs disabled. And that the scheduler will necessarily be called with preemption disabled to actually run the scheduler. And I can say that the context switch will always run inside the scheduler. That scared need we're scared and scared waking will always take place with IRQs and preemption disabled. And these kind of unnecessary conditions to call the context switch. And I know it's boring. And I did it on weekends, because it was in the very beginning, it took me a while to convince Red Hat that I could work this on full time. And it was summer in Tuscany. It was very boring. But I like it. I have to admit. It was nice doing a break. It was nice to show that we have a deterministic set of instructions that makes the parameter deterministic. That was the challenge I was looking for. And that is one good example. Before I was talking about necessary conditions, in which having the necessary conditions that event can happen or cannot. But there are also the possibility of doing sufficient conditions that says that after one thing happened, I am sure that it will, other event will eventually happen. So for example, after I said need resched, I will only return to the initial state after having the context switch. That is, I am forcing the scheduler. And I am blocking all the sleepable operations until there. I'm just trying to find the fastest way to call the scheduler while blocking other kind of code. Here I can abstract these are sleepable code on Linux or a code that are called outside of a frame disabled section. And so these are the possible paths. And I always return to the initial state after a context switch. This is a formal way to say that there is a deterministic set of events and sequence of events that will bring me the need for resched to the scheduling of the highest priority trial on Linux. And that's a good property to show in the real-time realm. So frame 30 is deterministic somehow. And talking to other communities, this idea of modeling, the approach of modeling and the model, it was accepted. There is one journal paper on the review now. So yeah, that is how to model. That was the first part. But how do we use the model? And how can we check the model against the Linux and the Linux against the model to try to get something meaningful for us as kernel users or developers? So in the first part of the project, I use something that is called offline runtime verification in which the model, I talk first about the model and then I will reach in the verification. So in one hand, I start modeling with the formal knowledge about Linux, how it works. I am a developer. I somehow know how it works. And the kernel. And then I start modeling my knowledge and tracing the system. And then I collected trace of Linux. And at that time, I took Perf and tried to run step-by-step the automata with the feed from the trace from the kernel. Then I was trying to validate the model. At the beginning, I found a lot of problems on my model because my reasoning was incomplete or was wrong. But at some point in time, I started seeing problems that were not in the model. They were problems in the kernel. And then it was the point in which we found these useful for runtime verification because before, I was just trying to explain the behavior of the Premcharty. So it was good. But in the sense that I found bugs, the automata verification, as you see, it's done in a good timing. We'll see this later. But as I was tracing the system and as the granularity of the event was so small, I reached cases in a single core. For 30 seconds, I generated 2.5 gigs of data. And that's a massive amount of data. And I was not even tracing all the functions. I was just tracing the synchronization. So what could we do to turn it better? So I translated that from an offline kind of runtime verification into an online and synchronous runtime verification. And these are the steps. And I'll explain one by one. So in the first place, we have the big model. And then I have to translate that model to something that we can run on a computer. So I wrote a PyTOS script that just translates the formal definition of the model into C code. Yay. And here's one example of the code translation. I have a simple model in which I said that while in Premcharty, I will not wake up a task. That is the necessary condition to call the schedule waking is disabled in Premcharty. And I just run the script and voila. I translated the set of states into an enum and the set of events on one enum. And I used the classical definition of automata. And I fulfilled the automata with the data from the C definition of the automata with information from the real automata. And it turns out to be very simple data structures. And we will see why it's good now, I hope. So that was generating the code. Now we will need to try to code to link that auto self-generated code with the kernel. And so here is one of the, here is the main function on the runtime verification, which is anytime I receive one event from the kernel, like disable in Premcharty, I will call this function telling which event I am receiving. And here is just the information about the tracing where I store data. So I get the current state. Given the current state and the event I am receiving, I am getting the next state, I am expected to go. If it's an impossible state, I go to the next state. I might print some debug information or not. But if it's not expected, I print the bug information and like do a stack trace. I can take any action. But here is just an example of stack trace. Now going into the functions. So get the current state is a variable lookup. Set the current state is writing into a variable. Get the next state is a metrics lookup. Simple things. Getting the name of the event is also getting the name of the event and the states are just vector lookups. So, yeah, all the operations are O of 1. And this means that it doesn't matter how frequently I call the event. It doesn't matter how frequently it will not affect. What will affect there is the number of states. Or no better. The number of states doesn't matter. What matters is the frequency that I call this verification code. And I just use one variable to keep the state. So these are good properties. And in the end, I load the model and while running the system, if I don't find any exceptions, if one event that's not recognized doesn't happen, I can do, I can print into the trace buffer or I can just ignore it because things are working fine. And if some unexpected event happens, I can print some information. So here's one example. Let's say that I was, that I catch one case in which I was in the primitive state. And after disabling, I, after I returned to the initial state, I had the scan waking which was not expected. And then I say, hey, there's a problem here. And this problem was a bug in the tracing subsystem and that a bug that no other tool could catch. And I forgot to open the link, but here's the link and here's the information. So okay, nice. But we know that there is no free meal, right? It's nice to say, oh, it runs, it's O of one, blah, blah, blah. And okay, I was using a not compact structure. I was using a vector, right? But I was thinking if I had any structure data, I would have two lockups in O of N or O of N by the numbers of events that can be called on that state. And the overhead of putting linkings and other things will be too high for just one variable that I want to store there. And also for the prem chart model, which is a, let's say a big model, it uses only eight kilobytes of data. It's not that much. It's affordable for nowadays computer for doing verification. So in the end, it's acceptable. And in practice, we don't need such a big model like I did. We can check just some property of the system that we want to observe. It's better to have a good model, a big model, because it's more efficient, but I can just check some properties if I want to. For example, I can do while testing the latency on the prem chart. I can also try to catch cases in which I would potentially go into sleep while I, in an atomic context or when I cannot go to sleep because I'm either with IRQ or Premption Disabled. So I could run, could try to catch logical problems in the prem chart while ruining the, the, the performance measurement. So we can do both things at once if the thing is performs well enough, right? And that's what we see now. How efficient is this idea? Okay, it's nice to say it's all of one, my boy. But in practice, like, we, we need to see things on the, on the way that we like with benchmarks, right? So I did two benchmarks. I used the Forex test suite, which I like it. And it's good for doing academic papers because they format the outputs on SVG and we can do a lot of transformation on the graphs and it's very nice. And I did test for either throughput and for latency, which is the domain I like more. So the base comparison in the kernel SEs is the kernel without any trace or verification, just the kernel running. And the trace is while tracing the events that I'm verifying, but just tracing it, not processing, just saving it in the trace buffer. And I'm using that model in the prem 13. So when not calling kernel as expected, it will not cause many interference, right? It's sometimes it's even in the error margin. So not calling the kernel, it doesn't affect. Okay, that's expected and it's good that things are as expected. But when we start to doing high activation on the kernel, we will see some performance degradation. That's normal, right? That's what there is more code running there. But the good thing is that the verification is always in between the baseline kernel and the tracing. And also, like you're running the latency tests with the cyclic test, like this is the line of kernel, the Premchart kernel running. And this is the line with verification and this is the line with trace. So it causes less overhead than tracing because we are not storing data in the trace buffers, the number of operations are bounded, they are smaller than tracing. So it, I don't want to say that tracing is bad. So because I'm friend of Steven and I like him. What I wanted to say is that this approach is efficient enough to be used in production because we can trace in production, right? And as we can trace in production, we can do verification as well because we don't cause more damage than them. And actually, I'm using the trace infrastructure to hook the events to the kernel while running. So I'm a trace user. And this idea was also academically accepted, like I talked to them and they liked it. And there are more things to come. So it's possible to model complex behavior of Linux using a formal language, creating big models from small one. And it's also possible to verify properties of the model. And as automata is run executable formalism, we can also run other kinds of formalism on top of the automata rather than on the code, like using temporal logic. And it's possible to run, to verify the runtime behavior of Linux. So now I'm working with Steven and Arnaldo to make a generic subsystem. I don't know how to say this or a generic way to run the, to verify the system using automata. And both with Perf because there are cases in which it's better to use an offline runtime verification and we have trace when it's better to use an online for runtime verification. Also, I need to explain this using, I have all these described in papers, but papers are boring for this community. And we prefer to read things on LWN because it's, you know, and I like LWN too. So we were talking to try to translate all these ideas into Linux kernel natural language. And there are other things that we need to model on the system, like parts of lock depth, for example. Because like one, one reason to try at least is that as I'm using dynamic trace, I can run all these things and plug and unplug while the system is running. I don't need to recompile if I already have the events I need. So it's handy. And I was, I'm talking to Joel Fernandes. We are trying to make something with RCU and trying to show that this idea is not only applicable for one case. Because it will not be useful if it's just for my case. All right. But, okay, it's worth mentioning now, like I say, there are many good things about the runtime verification of Linux, but blah, using automata. But I don't want to say that this is the only way and that this is the best way. It's not, it's not the best way. There are a lot of ways to model Linux. We can model on a code level and trying to explain like the behavior of spin locks and using like promela and then using temporal logic to verify. That's one kind of formal modification we can run. We can also have runtime or static code analysis that there are people find a lot of bugs and fix bugs with that. So the theory I'm working here is for runtime behavior, for system, for more design level, not, can be used at also code level, but more design level thing. And there is space for a lot of research on formal methods on Linux kernel. There are many good people working on many good projects. There's the personal project that verifies the drivers for Linux. It's linux testing.org, I think. And there are, the thing with Coxinel, they also based on, Julia also works on this. So there are many good people working and the more and more we will see the application of formal methods on Linux and the more we see the better, the more abstraction, the better, the more ways to verify it. It's better because we don't want to cause to be crashing because of a linux crash. Something else? No. That's it. It went fine. So thank you for not sleeping.