 Thank you for the introduction. So I'm Marco, and I'm part of the kernel dynamic tools team at Google, where I've worked on various sanitizers over the past few years. And one of them is the kernel concurrency sanitizer, which we upstreamed in 2020. And I've also worked on kernel electric fence, K fence in the Linux kernel and various other contributions over the years. So let's get started. So I think just to give you an idea of what we'll be talking about. First I'll introduce background on data races and to give you an idea of what we're actually mean by data races. I will also introduce the notion of memory consistency model, in particular the Linux kernel memory consistency model. And then I'll introduce data race detection techniques and KC sun. And then also discuss a few ways how you can detect concurrency bugs beyond data races. So what is the problem? The problem here really is that thinking about multiple threads of execution is notaries with difficult and there's usually this tension between performant versus simpler synchronization mechanisms and especially in the kernel where we have numerous advanced synchronization mechanisms. And we also know the kernel's job is inherently concurrent. So concurrency is everywhere. We need tool systems to do our job well and write reliable code. So with that, what are data races? Data races came about because the C language and compilers evolved oblivious to concurrency and there are numerous optimizations that compilers can apply to plain C language accesses that would ultimately break concurrent code. That is, there'd be a problem if the programmer requires and believe that plain C language access actually happens as one indivisible atomic step. So the solution to this problem is to make the compiler concurrency aware. But how do we still keep all the nice compiler optimizations that in the common case are useful and make our code run faster? The best solution we have is to tell the compiler about accesses that will be used concurrently. And if you're interested, there are lots of articles on this and one is linked here in the slide. And data races, they are defined at the programming language level. So in the subspecification of the language called the memory consistency model. And since C11, C and compilers are no longer oblivious to concurrency, they introduce the memory model along with atomics and so on. In C11's world, data races cause undefined behavior simply because it's too difficult to enumerate all possible compiler and architecture combinations and how various optimizations might behave. So it's really an impossible task to give well-defined meaning to data races. But that's not the Linux kernels model exactly. So the Linux kernel has its own memory model, the Linux kernel memory consistency model. And for various reasons, but in short, it's not straightforward to match the kernel's needs with what C11 would have us do. So what are data races? So I will go into more detail later. And at the very basic level, a data race occurs if we have concurrent conflicting accesses. That is, they conflict if they access the same location and at least one is a write. And at least one is a plain access. So plain C language access, not specially marked for synchronization. So data race free code has several benefits. It is well-defined, means you avoid having to reason about compiler and the architecture and avoid having to reason. So is this data race benign? What might the compiler do here? What might the architecture do here depending on how the compiler lowers it to assembly? So this is not where we want to be. And ultimately, we will have fewer bugs. Data races can also indicate higher level race condition bugs. For example, failing to synchronize accesses using locks. And ultimately, we want to prevent bugs. So data races are very good signals to us to investigate piece of code and thereby avoid countless hours debugging various tricky race conditions. So I want to introduce a motivating example. And that is what the compiler might do and will break concurrent code. So there's a common compiler optimization that would perform load fusing. So here we have a function which performs two accesses guarded. And then first each if statement will load x and then perform an access. And again, in the unoptimized code, it will load x and then do another access. And a common compiler optimization here is to simply perform the load to x once. And this is called load fusing. So if we had a piece of code that wanted to do, say, waiting on a particular variable to become non-zero, so we would essentially here we spin in a loop and keep reading stop, at least in the unoptimized version. That's what we think it would do. But the compilers are clever. And in the absence of telling compilers about concurrency are free to optimize this piece of code into the code on the right side where there's a single load to stop and then an infinite loop. And that's not what we want. So if the programmer had intended for there to be potential concurrent right to the stop variable, then this will break. And in the Linux kernel here, for example, we would use write once to concurrently write to the variable reference to buy stop. And in this case, it will break. The solution to this in the Linux kernel is to use mark accesses such as the read once. And I will go into more detail later. And again, data races can be symptoms of higher level logic issues. For example, one of the first issues here, for example, in the FAPT subsystem in the Linux kernel that was found with KC Sun. So if we are looking at this piece of code at the particular accesses that KC Sun pointed out to us, so these concurrent write and read, we would never guess what the fix might be. So the fix here is only obvious, perhaps, to a maintainer who deeply understands the code. And this data rates was merely a symptom of the higher level issue. So before I proceed to explain more about data race detection, I want to explain what is a memory consistency model. And in particular, the Linux kernel memory consistency model. So a memory consistency model ultimately is a specification about how loads and stores are being observed in the presence of concurrency. So the simple question we want to answer is, what value does the read access observe? And to write correct concurrent code, ultimately, the programmer needs to understand the semantics of the system they are programming. And the memory consistency model specifies precisely the ordering guarantees of memory operations with which programmers can reason about their parallel programs. And memory consistency models exist at different levels in our stack. So at the hardware level, some of you may be familiar with what the x86 memory model is about. For example, x86 TSO or ARM and different variants of ARM are out there as well. And PowerPC and Alpha are just examples of architecture level, hardware level memory consistency models. So this is the system-centric model that we are talking about. But it is important to note that when writing portable software, we should not be thinking in terms of the system-centric model, but rather the programmer-centric model. And this is, at the programming language, we have our own memory models. And since C11 and C++11, we have memory models in C++ and C. Unfortunately, not used by the Linux kernel. And also Java has its own memory model, for example. And various other languages are also beginning to introduce more rigorous memory models. So language-level memory models ultimately are about telling the compiler where to expect concurrent code so that the compiler can rein in some of those optimizations that will break concurrent code. So we have to distinguish between marked accesses, or often usually these accesses are atomic, for example. And C and C++11, we have stood atomic and C++, for example. So these are special accesses that will tell the compiler to emit particular atomic access and not to perform other optimizations around that location, for example, ordering operations in a particular way to avoid breaking concurrency. And then we have our normal plain C language accesses, which the compiler is free to optimize in all kinds of ways. And these marked accesses provide various ordering guarantees. And are also the building blocks for higher-level synchronization. So they're really the lowest-level primitives that we have to know about to write concurrent code. And the compiler is not allowed to transform code in ways that would weaken the memory model. And this is an important point. And that was before C11 and C++11. This has been an issue, and it has, of course, also been an issue for the Linux kernel. And there's a trade-off to be made between ultimately more performant memory models and simpler memory models. And the strictest and simplest memory model is called sequential consistency. But it doesn't allow for as many optimization optimizations and optimization opportunities in the compiler as well as in the CPUs. So weaker memory model allows for greater opportunities for speculation, which usually translates into greater performance. And here I have this figure, which I've created many, many years ago. And it lists various memory models. At the bottom of the figure, you see the system-centric models that you'll find in CPUs and even GPUs. And at the top are the programmer-centric memory models that we find in our programming languages. And here I've placed the LKMM deliberately in between, or rather on this axis of power PC arm, which is considered weaker than the C11 and C++11 memory models. And it also makes it a more complicated memory model. But we'll get to this in a few slides. So Linux kernel memory consistency model, or in short, LKMM, resulted out of, I guess, many years of experience in the Linux kernel trying to target vast set of architectures and producing as efficient code as possible. And the Linux kernel's requirements changed over the years. And as it evolved, ultimately resulted in a non-standard memory consistency model, for better or worse. But it means that the Linux kernel is in full control over the ordering rules and the precise rules and also allows for greater optimizations in some cases. But it has evolved and changed as well over the years. So the memory model that the Linux kernel used from 10 years ago is no longer the model that it uses today. There is a formal memory model as well, if you're interested, that people have tried to formalize the Linux kernel's memory model. And that has greatly helped also in pinpointing the various rules of the Linux kernel's memory models and refining them as well. But real code currently uses slightly different rules. And sometimes these rules are also more relaxed. So there is an informal documentation. You may have heard of memorybearers.txt. And unfortunately, this documentation is also not complete either. So in there, it even says this document is not a specification. It is intentionally for the sake of brevity and unintentionally due to being human incomplete. And if you're also interested in more discussion on this topic, there is a very good paper from 2018 that you can look up. So the basic marked accesses of the Linux kernel memory model and their rules are listed in this table. So the most basic one is read once. And the result here is simply it returns the value of x. It orders later dependent reads and mark writes. Write once is simply a write of y to x. And it orders nothing. We have SMP load acquire, which returns the value of x in this case. And it orders later reads and writes. We have SMP store release. It's a write of y to location of x. And it orders earlier reads and writes. And then there is this also marked operation, rcud reference, which is similar to read once, but should be used in rcu context, which returns usually a pointer of x. And it orders later dependent reads and mark writes. And then SMP mb, which has no result. It's a simple barrier, which orders earlier and later reads and writes. SMP rmp, also no results. And it orders earlier and later reads. SMP wmb, the result is again nothing. And it orders earlier and later writes. And as I just mentioned as well, some of these operations order dependent accesses. So read once in rcud reference here are special. And this is one important aspect of the Linux kernel memory model, where it differs from other language level memory models, in particular C++ and C, where the Linux kernel really wants dependency ordering. In other memory models, you may have heard of consume ordering, which has also caused other problems. Due to being very difficult to implement, it's not actually implemented. But the Linux kernel still wants dependency ordering and does its best to implement or guarantee dependency ordering. So read once in rcud reference orders, later address data and control dependent mark writes or address dependent reads. As an example, so here we have a read once, which loads a pointer foo, for example. And x, then the variable x is again, it's a pointer. And then it dereferences that again with a read once. And in this case, this is called an address dependency. And the LKMM tries to guarantee ordering of these two accesses. Similar for data dependencies and also control dependencies, but control dependencies only with later writes. So here we have read once and that loads some variable into x. And then this x is used in a condition. And then if the condition is true, performs a write once. And this read ones to write ones are supposed to be ordered through a control dependency. And here is an example as well that is not ordered through control dependencies, where we have a read once. And then the value that was loaded into x, if that is used in a condition. The read ones that is executed if the condition is true would not be ordered. And a warning here, this is really one of the most tricky aspects of the Linux kernel memory model and is likely to change in future because compilers can still break dependencies. And there is even from last year, the Noxplamos conference, it was shown that this is happening in the wild, unfortunately. So there are people working on solving this. And there is a chance that this aspect of the kernel memory model may change in future as well. There are many more marked accesses. For example, all atomic T accesses. So the Linux kernel has this container called atomic T, which provides together with various accesses, atomic accesses. For example, atomic increment, decrement, comparing exchange, and so on. And various other atomic read modify rights the kernel has. There are also marked accesses and atomic bit ops found in the kernel as well. So now we get to the part where I can tell you more about what data races are in the Linux kernel memory models world. So data races occur if we have, and as I earlier mentioned as well, current conflicting accesses, they conflict if they access the same location and at least one is a write. And at least one is a plain unmarked access. So here we have several examples. So the first one, this is obviously a data race. There is a read of x and then also a concurrent write. This would be a data race. In the second example as well, we have a write once. Although this is a marked operation, the read is unmarked. So therefore, this is a data race. Here, the third example, we have a read once. It's a marked operation, but an unmarked write which is also would be a data race. There are still a lot of, and here the third example I want to point out. So even though the write is unmarked, there is still a lot of concurrent code in the Linux kernel where the writes may not be marked. So this is one thing that also with the help of case CSUN, we're trying to improve, but there are still some cases also perhaps due to legacy preferences of various maintainers in these particular subsystems that not all writes are marked. So this is just a caveat I want to point out. The fourth example also, this is a data race. We have a marked read, but an unmarked read modify write operation. So just an increment of X, and this is a data race as well. And then also here, we have two unmarked writes. This would also be a data race. And then the last two examples are not data races because all the operations involved here are marked. So in the first case, here we have a read once and the write once, both are marked accesses. Therefore, this is not a data race. The last example here, we have two write once. Both accesses are marked, therefore, not a data race. So intentional data races also still exist in the Linux kernel. So Linux kernel says that data races do not result in undefined behavior of the whole kernel, but rather is locally undefined. So where code can still operate correctly even with potentially random data, data races are tolerated. So for example, statistics counting is one example where you may still find in the Linux kernel a lot of intentional data races. And through upstreaming of KCSAN, we also introduced this notion of explicitly marking such cases with a data race expression. So this is just simply data underscore race. And it is useful to document the intent that there is an intentional data race. It's not a bug. And it also helps to links such as KCSAN to understand they are intentional. And there's a very helpful document in the Linux kernel called accessmarking.txt, which has a lot of these guidance. And I think here I want to briefly pause maybe for some questions. Marco, there is a question in the Q&A box. Would you like me to read that to you? Or do you can see it on the screen? Right, I think I can see that. All right, okay, so there's a question. Can you throw light on exactly what exactly memory ordering is with a few examples? Right, this is a good question. So memory ordering, and let me just go back to perhaps the dependency ordering example. Even though it is not restricted to dependency ordering in any way, what usually happens is so if you're writing concurrent code, you want accesses to appear in a certain order with respect to other concurrent observers of that data that you're writing, for example. So if we are here, for example, we have read ones, two read ones, which are ordered. And one example here would be that the address pointed to by foo, so that location is written by some other thread. And the other thread is using perhaps a release operation. So here I mentioned release. So what does SMP store release do? It orders all earlier reads and writes. Then what might happen is that other thread, before doing this SMP store release, it was performing updates to some data, writing a lot of other data. And then finally, it wants to publish this data by updating a pointer. In this first example here, where we call out this address dependency, for example, it will publish that with a pointer, writing that in this example foo, where the reader of this data wants to observe the changes that happened before. So essentially before they were published, it wants to observe all of these changes. What can happen in weekly ordered architectures and also through compiler optimizations is that without telling the CPU or the compiler that you want to order earlier writes, the CPU or the compiler is free to reorder operations around other writes. And if, for example, you're not using an SMP store release to publish data updates, CPU or compiler can reorder updates after that, which would mean that another current reader later is not guaranteed to see the updates that happened before you were telling it to get all the new data, right? Does that roughly answer the question? I guess we'll call it answered. Okay. Please come back if you have a follow-up question. There is another question about in the chat. If you can see the chat, I posted it for everybody. Yes, I see that. So I will just read it. The Rust programming language seems to address part of the memory management problems that you describe here. Wouldn't it be useful to use it in Linux kernel in a more deep level? So at this very, very low level, every language has a memory model. Even Rust has a memory consistency model. And in fact, Rust is adopting, as far as I'm aware, it is adopting a slightly simplified version of the C++11 memory model. So this is not about locking per se. A memory model gives you the rules with which you can implement locks and other synchronization primitives. And Rust, of course, also has to implement these primitives somewhere. And the memory consistency model will give you the rules to implement higher level synchronization primitives. And of course, there are also cases where someone might want to do a lock-free programming in Rust. In this case, you also want to know the rules with which you can reason about your concurrent code. Of course, as I'm guessing in those cases, it might be common to have to drop to unsafe Rust. But in safe Rust, as far as I'm aware, you're also using synchronization primitives, like spin locks or mutexes, and the rules with which they are implemented follow the memory consistency model. I hope that answers the question. So what you're saying is there is no magic wand. So you have to do the hard work of implementing, coming up with a memory model, adopting it and implementing it. Correct. And every language, and I also want to stress this, every programming language where you have concurrency has a memory model. Sometimes it is something that you get, you're exposed to more than in other programming languages. Some languages memory models are simpler than others. But fundamentally, if there is concurrency, there's a memory model either informal or formal, or something in between. But without a memory model, the architecture, the compilers and the programmers could never agree. So this is why memory models are so important. And Rust definitely has a memory model where you're then talking about the language, and then the compiler implements that. It lowers it in a way that is compatible with the architecture, being careful that the guarantees made at the language level are still preserved at this system-centric level that I hear this figure, for example, right? So if you're lowering from a language, it's like C++ or C to an architecture that implements TSO, you have certain rules how to do that. And there are also different rules how to implement the memory model correctly if you're lowering to, for example, PowerPC or ARM. And the same also is true for Rust. It is another question in the Q&A. Right. Okay, so how do we decide when and when not to use S&P primitives, for example, driver authors, aren't typically used these unless going across a hardware boundary? Is there a definitive way to know when and when not to use them? Yes, so the simplest way would be if you have a data race, you probably should use them, right? So this is also what I'll get to later in more detail and also how KC sound can help you in particular. So if you're writing concurrent code, you want to, of course, you want to understand in which order data is supposed to be modified and published. And then if you're writing again, if you're not using locking explicitly in some cases, this may be necessary. And you're writing concurrent code and you want to be aware of, I guess the concurrency design of the code and then you need to know in which order and where potentially concurrent updates can happen. And in all those cases, if there is concurrency between different accesses, you most certainly want to use marked accesses because otherwise you would have a data race, right? So this is what the definition that I have here points out, right? So if you have concurrent conflicting accesses and they conflict, if they access the same location and at least one is the right. So at that point, if you're not using a marked access, you're going to have, if you're not using marked accesses, you will have a data race and you'll run into all the issues with data races. And that's a very clear signal that probably if you should rethink the concurrency design if you have data races in critical currency algorithms. I hope that answers the question. So if there are no more questions, I think we can proceed. Yes, go ahead. All right, so data race detection in the Linux kernel. So first I want to point out dynamic analysis because this is the techniques that we're going to use to detect data races are all dynamic analysis. So dynamic analysis is about detecting certain bug classes or issues with your code dynamically at runtime. And the way this works is that you take a bunch of source files or even binary files, you modify them in some way by inserting checks and then the final executable will have additional runtime checks in them. That when you're executing the code, there are all these checks basically are performed as you're executing code. And if there is a state change in that the dynamic analysis runtime observes that is invalid or erroneous in some way, it would then generate a report. And there's also there's a previous webinar from 2021 which actually talks about dynamic analysis in general in more detail from my colleague, Dmitri. So in the Linux kernel, there have been various past attempts at data race detection and the most notable one is the kernel thread sanitizer. And I also want to point this out because if you're perhaps familiar with data race detection in user space, you may have used thread sanitizer. So thread sanitizer is one of the most widely used data race detectors for user space. And of course, one of the first things that was tried is to implement the same algorithm in the Linux kernel. KT sound in this case, or also thread sanitizer is compiler instrumentation base. So it uses compiler instrumentation to insert these checks with this flag F sanitize thread. And the algorithm is based on detecting happens before relations between concurrent operations. And this can be quite costly to implement because it requires vector clocks, which are actually very rather space intensive data structures. The pros of this approach was that it has fewer false negatives. It's very precise and it takes memory ordering issues quite well. So missing memory barriers, right? If you're instead of using SMP store release, you're using write once, for example, this approach would quite well detect these issues. The problem was that KT sound was not scalable. It has huge memory overheads. So this is in the gigabytes and false positives, lots of false positives, unfortunately, without annotating all synchronization primitives. And this was, I think, ultimately, what broke the camel's back, I guess, if you so to speak, because false positives is something that really cannot be tolerated in the Linux kernel because wasting developers time is the last thing that we want to do. And also it diminishes the, I guess, the people or developers are investigating reports. Any false positive will cause trust in the tool to degrade and that's not what we want. And the way this worked is simply you have a normal C code and then the compiler with the fsunties thread flag added. And this is GCC and Clang both support this flag. They will add these checks before every access and then do their checking if there are data races. And this again, this depends on the algorithm in KT-SUN, it was a very different algorithm. And KT-SUN, again, is a very different algorithm, but the idea of inserting checks with the help of a compiler is shared. There was yet another approach, various other approaches that were not based on compiler instrumentation. So that was this tool called racehound, which is based on the data collider approach. And the basic idea here was simply you set the hardware watch point on some access, you wait for a bit, and then if the break point triggered while you were waiting, then there's a race, right? So current access, the only way the watch point could trigger if you're waiting and not doing anything else in the thread where the access is supposed to happen, then there must have been a concurrent access. And also if while you're waiting and twiddling your thumb, right, if you're seeing that there is a concurrent access that changed the value of the location that you're watching, then you also can infer that there was a race. But these ideas never made it into the mainline noxcurve. And the question is why? And I think when we looked at this problem again, like this was four years ago now, we looked at this problem and we were wondering, what can we do, right? So what are the requirements for the Linux kernel? And I think, of course, one of the most important criteria here is runtime performance. We want to have something that is performant. I think all these approaches, including KT-SAN, they were performant or essentially could be optimized to the point where they are performant enough. Also low memory overhead here, KT-SAN unfortunately showed us that to optimize it to a point where it would be acceptable even for smaller systems, it was too difficult. And also the preference of false negatives over false positives. And as I mentioned, right? So with every false positive that would be sent to a kernel developer, the trust in the tool is diminished. And this really is one of the biggest problems as well with KT-SAN because it required keeping up with changes, upstream changes to synchronization primitives to make it have no false positives. And this was really, as I just say, even if there's momentarily between releases, there might be some false positives as they're being fixed in the tool themselves. I think that we determined to be unacceptable. And this also then leads to maintenance, right? It's supposed to be unintrusive to the rest of the kernel. And again, with KT-SAN, this was a big problem. And here we want to also have scalable memory access instrumentation. So with the help of the compiler, this can be made scalable. The compiler just selects every access and it automatically scales to the whole code base. And also the analysis should be language-level access there were. So it should be compatible with the Linux kernel memory model. And here again, this watchpoint base to put it at least to the point where at least RaySAN and data collider that they considered for the Linux kernel, it was too difficult or at least it wasn't considered. With KT-SAN, again, this is through the very synchronization rules and so on because the compiler is involved, this is supported in this design. And I want to show you KT-SAN, which will satisfy all of these criteria and essentially was a rethink of how to do data-race detection, scalable data-race detection in the Linux kernel that satisfies all of these requirements. So now I'm going to introduce finally KT-SAN, the Linux kernel concurrency sanitizer. So KT-SAN is a compiler instrumentation-based. It's a dynamic data-race detector, so dynamic analysis-based and it reuses Clang and GCC's thread sanitizer instrumentation but the runtime is completely different. So by default, it just detects data-races as I've defined them earlier. And with special assertions that I will also discuss, it can help you find even more bugs beyond data-races. And it was introduced in the Linux kernel 5.8 and since then several improvements have been made and the most notable one that I want to point out here is that with kernel 5.17, KT-SAN can even detect some data-races due to missing memory barriers. And this was something that due to the algorithm of KT-SAN, initially we thought that we couldn't detect missing memory barriers but KT-SAN can detect a subset of missing memory barriers. And here we're combining the best of, I guess the previous approaches that I discussed. So with KT-SAN, so the thread sanitizer instrumentation, we're instrumenting the entire kernel and then we're taking this approach of also using watch points, right? So we want to observe exactly when two accesses happen concurrently and the way to do this is to use watch points. And here KT-SAN is essentially it's observing that two accesses happen concurrently right when they happen and it checks accesses, all accesses that are instrumented by the compiler. But one big innovation in KT-SAN was that we're using this notion of soft watch points. So they're not hardware watch points and we wanted to keep KT-SAN portable to lots of different architectures and the irrespective of hardware breakpoint support or hardware watch points support. So and to do this, we implemented a fairly efficient algorithm for soft watch points. So, and here again, the idea is we set a soft watch point. If an access happens that was instrumented by the compiler, we enter the KT-SAN runtime, set a watch point stall briefly and if the watch point fires or it already exists, we know there's a race. If the value changed as well, like before the access is performed by the current thread, we check the value. If it changed after this brief delay, the stalling that KT-SAN does, then we also infer it as a race. And it stalls accesses with random delays to increase the chance to observe a race. The various, like the default parameters you can see here but you can also change them at runtime or with a kernel config. And one other thing with KT-SAN is to keep the system performant that uses a sampling strategy. So every access is checked for there being a watch point setup, but it samples accesses to set watch points on. So because of watch point itself, we stall for a few microseconds. Doing this on every access would be one wasteful and also would slow down the kernel too much and by default it does this for every 2000 excuses. And but this can also be configured. So for example, if you want to have a more aggressive KT-SAN then you could do this with a simple config change or even at runtime by changing the kernel parameters and sysfers. The caveat here is that this means that there's a slightly lower probability to detect very rare races, but we found this is offset by very, very good stress tests or just fuzzers like syscaller. And when we first experimented with KT-SAN we found hundreds of data races immediately. So obviously that meant the approach works very well if we paired with a good fuzzer. So now how to use KT-SAN? KT-SAN these days is supported on various architectures such as x86, only 64 bit x86, ARM64, S390, MIPS, PowerPC, Xtensa, and perhaps even more in future. Compiler requirements, you require at least Clang 11 or later or GCC 11 or later. And you can build your kernel with simply setting config KT-SAN equals yes. I will show you in a second also how to get to the same config with the help of menu config. And I want to stress that KT-SAN is only for debugging and testing kernels. So it's not recommended to enable in production kernels because there is a about more than five X slowdown. But this also depends on I guess your exact system configuration, how many CPUs you have and so on. But generally you can expect roughly five X slowdown. These days also if you're trying to use KT-SAN to debug your own code, I strongly recommend using KT-SAN strict mode which follows the strict kernel memory model rules. Beware that as of the latest kernel and enabling this mode may still produce lots of data races which are not yet addressed or fixed. Most of them are considered relatively benign but to fix them properly it will take time and I'm expecting that it might still take a few years for all data races to be completely resolved in the kernel. And if you enable KT-SAN strict mode it also includes weak memory modeling. So detecting a subset of missing memory barriers. For example, if you're using a right once wrongly and you should have used an SMP store release instead. So KT-SAN can help you detect issues like this as well since kernel 5.70. And to enable KT-SAN you go to in with menu config you go to the kernel hacking section into kernel debugging instruments and then KT-SAN and there you can just configure KT-SAN. Usually the defaults are good enough for everyone for most people. One thing that you have to explicitly enable which is not enabled in the default config is this strict data race checking. And this has several reasons is that there are still of course a lot of code or subsystems that don't receive as much maintenance as others still have a lot of data races. And we just wanted to make sure that the data races that KT-SAN points out are in terms of severity more on the more severe end of the spectrum. And we added some heuristics to try and filter data races that are more on the severe end of the spectrum. Nevertheless, if you're writing new code or you're trying to use KT-SAN on your own code specifically, I highly recommend using the strict data race checking mode. And if you boot the kernel with KT-SAN you will usually see a message like KT-SAN enabled early just to double check that KT-SAN is actually enabled. And if you are running with KT-SAN enabled you will see reports such as this. So here you usually and you have a title which shows two functions that raced it has access information about these the two racing accesses, it shows the operation at which address, which size the access was and then also the context and on which CPU. And then it shows a call trace. So basically the current stack trace that leads up to that particular address. And if you, for example, put this stack trace through address to line then it should also tell you exactly which line in the source code the access is. Optionally KT-SAN can also show you locked up info if you've compiled the kernel with proof locking on and KT-SAN verbose mode is also enabled. It will also show locked up info. This can be helpful to also debug issues more easily. And then it also shows both accesses and at the bottom it gives you a brief summary of your system as well. And the severity of data raises is something that's probably one of the most tricky topics or and to really understand the severity of a data raise. This is something that also we're still trying to understand and even are trying to work on techniques to perhaps automatically classify severity of a data raise. But usually it is, if you're finding a data raise you will start debugging. And there are several types of concurrency bugs that a data raise may point out. And in the first case, so in the worst case it is a race condition bug whereas the resulting error manifests as a data raise followed by eventual system failure. So for example, the kernel panicking as a result of a race condition bug that led to a data raise where the data raises is only a symptom of the race condition bug. So these are the trickiest to debug and fix as well. So because simply marking the accesses with the primitives that I've shown you earlier does not fix the problem. And the fix usually requires more invasive changes to program logic. So for example, adding missing locking or perhaps even rewriting completely parts of the logic of the subsystem that you're dealing with. So and in the second case it's a data raise may point out that a miscompilation from the compiler. So if the compiler optimizes the code in a certain way it will result in this case of miscompilation with respect to concurrency. And that will introduce bugs that can lead to system failure. So in this case also, if the compiler performs optimizations in certain ways, the fact that there is a data raise would then potentially also lead to the system crashing or the kernel panicking. And the fix here would be to use the appropriate marking marked accesses. So as I pointed out earlier appropriate marked atomic accesses and fixing these types of concurrency bugs usually doesn't require fundamental changes in program logic. And then the third class of issue is where a data raise may point out that compiler optimization or essentially a compilation, miscompilation in a way that programmer hadn't intended for will introduce tolerated inaccuracies, right? So in this case, we might speak of benign data raises but these miscompilations won't actually lead to catastrophic system failure. They won't crash the kernel. And typically this would be something like approximate diagnostics. For example, statistics counting where if you're missing say one increment of a statistics counter, it is not catastrophic. It may only result in you perhaps reading an inaccurate statistics somewhere. And in those cases, if you determine then that you actually can tolerate compilations or optimizations, compiler optimizations that will result in these inaccuracies you can actually just mark the data raise where the accesses that are involved in the data raise with the data raise macro, which will tell cases and to never tell you about the data raise that you that I pointed out ever again. And this is also helpful then to actually to point out intent that it's not to bug and someone else that will find the same data raise will not wonder if it is a bug. But it is very important that if you're running KC Sun you're finding data raises to non-blindly mark accesses in various ways, say for example by just blindly sprinkling read ones or write ones everywhere, which would then result in hiding bugs of type A. And we want to avoid this. And this is some also the reason why it is taking so long since the introduction of KC Sun in the Linux kernel that we're still having data raises and it is taking long to actually slowly, slowly work on removing all the data raises in the kernel simply because there aren't enough people or the people who have the knowledge of subsistence where there are data raises don't have the time to investigate the severity of the data raise. And I think at this point we also can take questions. There is a question in the Q&A box and I posted a question in the chat as well, Marco. Yes. Okay. Kaiwan says, why I found that setting conflict KC Sun assume plain rights atomic, yes, two can help. So I'm not sure in which context, but generally this probably relates to the noisiness. And yes, this is one of the things that in a lot of subsystems, plain rights, racing plain rights with other say, marked breeds are not considered data raises and it is simply optional to mark them in some cases, but we have added this heuristic or this rule to KC Sun to consider plain rights that are aligned and up to machine word size to consider them atomic because with at least most compilers that we're aware of they will simply emit for the architectures that we're, I guess that are relevant an atomic store, but this is not guaranteed and it may change in future. And generally that's why I recommend if you're working on your own concurrent code, I recommend using KC Sun strict mode, which will turn off all of these heuristics and rules. I hope that answers this point. Yes, correct. It really is reasoning about data raises and which compilers generate the code you want as a really, really difficult undertaking. So that's why in general, we do just recommend to get rid of all data raises. And if you're particularly worried, use strict mode, get rid of all data raises and you don't have to reason about how the compiler might compile your code. Okay, next. Right, so there's one question. Does 5x slow down make it harder to debug race conditions and timing related issues? Actually with KC Sun, it makes it easier to sometimes hit certain race conditions because KC Sun inserts these delays. So that actually results in potentially like different schedules or perturbations in the schedules between different threads of execution to become more likely, at least that's what we found. So it introduces some randomness. Yes. That allows finding problems. Okay, that makes sense. And also the delays can be tweaked, but I think the defaults that are chosen are such that it actually, it doesn't slow down the kernel too much and but it also introduces some randomness. So the chosen, like the configured delays in the kernel config are the, it's the upper bound and then it chooses some random value. So that way with different runs, it might actually find different schedules or perturbations. But we found with adding these delays or through, I mean, even with these extra instrumentation inserts at sometimes different schedules or perturbations in the thread interleaving become more likely. But of course there may be also cases that go the other direction. So I also acknowledge that point. Thank you. There is a question, does this randomness make bug reproducing much harder? Reproducing concurrency bugs is very difficult even without KC Sun. With KC Sun, I don't think it is any more or less difficult. What we found is, at least with Syscaller, we still haven't found a good way to generate reproduces for data races. And this is simply because it depends on all kinds of factors, right? If you have different threads of execution, you have to try and hit the same interleaving of all these threads of at least set of interleavings that make some data race possible. And that's what we're trying to reproduce and reproducing that is very difficult. And this is an ongoing area of research. I think we, at the moment, we're not investigating reproducing concurrency bugs or even data races actively, but it is a very interesting area of research and something that we hope to address in the future as well. Using, for example, Sysbot, if it Sysbot finds data races, having reproduces would be very helpful. But at the moment, it's not able to do this reliably. Thank you. Looks like that's it for the questions for now, Marco. Okay, so then I will proceed. So, so far we've talked about data races and how you can detect data races and then also the severity of data races. When data races might tell you about the concurrency issue that you have in your code. But there are also, of course, concurrency bugs beyond data races, concurrency bugs that do not or would not manifest as data races. NKC Sound can also help you find some of these issues. So I will introduce an example here. So for example, we have one thread where there is a spin lock and locks the update foo lock and there is a comment. Careful, there should be no other writers to shared foo. And readers are okay. So in this case, what we have is a shared variable, shared foo, which can only be updated by a single writer, but can have multiple concurrent readers even while it is being updated. So here we have another thread that will do this exactly without the lock. It won't take update foo lock, but it will read the shared variable with a read once. This is not a data race. And this is allowed according to the comment. And then we have a third thread, which does exactly what we didn't want. There's a bug, there's a concurrent, potentially concurrent write to the shared variable because the third thread is not taking the lock. And this is a bug. And with the help of KCSAM, it could introduce this assert called assert exclusive writer. So KCSAM has several macros, this family of assert exclusive macros, which can help you add additional annotations to your code to convey the intent of your concurrency design to KCSAM and then help you find bugs such as concurrent writer to the shared variable without taking the lock. And there are several of these macros and they're all prefixed with assert exclusive and they can help you specify properties of concurrent code where bugs are not normally data races. And I want to stress that if there are accesses or say unmarked accesses where a bug would likely manifest in a data race, then I would strongly suggest to just not, I mean, not to unnecessarily sprinkle these assertions into your code, but if there are, for example, marked accesses in your code and the concurrent update to some of these might be a bug, using these assertions can help you with that. And in the kernel log, these bugs of this type, if KCSAM detects them would be prefixed with KCSAM, assert race in and then the functions and then the usual report that it shows. And the three types of, or rather there are five, but there are two that I will get to in a second, what the difference between them is. So we have assert exclusive writer, which takes a variable. So essentially it's similar to, if you're like passing a variable directly to read once or simply accessing the content of that variable, that's what you pass to assert exclusive writer. So if you have a pointer, you would dereference the pointer, so an expression that returns a value. And in this case, assert exclusive writer will assert that there are no concurrent writers, but there can be concurrent readers. There is a second type of this same assertion called assert exclusive writer scope, which will assert this for the entire scope of a function. So if this is put at the start of a function, so assert exclusive writer scope at the start of a function will perform, will assert that there are no concurrent writers for the entire function. And then more strict assertion assert exclusive access. So it will assert that the current thread has exclusive access to a particular variable location and will assert that there are no concurrent writers and no concurrent readers. And there's a third variant of this called assert exclusive bits, which will actually assert that there are no concurrent writers to a subset of bits of a variable. And this was introduced because there was a case in the memory management subsystem. And we wanted to mark or essentially tell KC-SAN that updating a particular bit in a variable concurrently was actually a bug, but all other bits we didn't care about. But if you're interested in this, the documentation has a lot more detail on this. Or I can answer it later. Please add questions if you have to summarize. So concurrency in the Linux kernel is quite challenging. And we need to use good tooling for this and KC-SAN is one of these tools. There are other tools in the Linux kernel as well that can help among them, for example, locked up. But the types of bugs that the other tools to take are different from KC-SAN and KC-SAN really helps to take concurrency bugs at this lowest level and tries to help you tell them about data races and so that you can avoid introducing concurrency bugs as early as possible. And my suggestion is to avoid data races altogether if you can, unless you have, for example, these approximate diagnostics that I mentioned earlier, but in general, all new concurrent code in the Linux kernel, I strongly recommend using KC-SAN and its strict mode to avoid data races. And there's a lot more documentation on KC-SAN and then also a two-part article concurrency bugs should fear the big data race detector which goes into some of the philosophies around why certain subsystems may choose slightly different styles of preferences to how to mark accesses. And I think with this, I can take more questions. Any other questions for Marco? So what you're saying is when you write a new code or if you have an existing code base that you have never run KC-SAN, run it in strict mode to find the problems and tweak the randomness if you can to find more problems if you will. So at least do that on your code base and then if somebody writes a new driver, for example, we say this, right? When somebody writes a new driver, we say enable locked detection and then all of the locked, spin locked the debugs to find all the places that could be problems. So KC-SAN can be used to just detect all of the data races and the such concurrency problems you might have in the new code and also existing code. So I think, yes. So I think the main point is that I want to encourage everyone to write code that is free from data races and use KC-SAN strict mode as well. So the stricter treatment of the kernel memory model rules. So I think I had a slide somewhere. I may have skipped this one. So on also testing best practices, right? Because you mentioned drivers, for example, and I think often it can be quite tricky actually to generate or to execute tests that will hit these corner cases and then perhaps data races or rare data races. So I think this is a good strategy is to with new code or even existing code that people want to improve is to write rigorous concurrency tests. So of course we want to, and I will go over the slide briefly as to design test cases to cover, of course, both expected and unexpected interleavings. And this can be slightly tricky, right? You have to really think about how different threads of executions may interact. And then of course yet, ensuring all the various tricky corner cases are considered real-world cases are considered and then stressing the code with a high number of threads that simulates worst case scenarios. And for drivers, I think, I mean, this is really, it's a interesting exercise, I think, to design tests that will stress the code in ways that would simulate worst case concurrency cases. And I think here, it's important to also write tests that can quickly be executed repeatedly so that you can just set a, how many times do you want to execute a particular case in a loop and then do this really, really quickly for thousands of iterations? And this will help KC Sun also, it will help improve the probability that KC Sun can detect issues. And then also fuzzing, of course, but with fuzzing again, it's not guaranteed that fuzzes will be able to find the tricky corner cases where there is a weird data race or race that will then result in a crash because you have to really think about how you may generate inputs such that the kernel will then execute different threads and produce interleavings that will result in a bad state and then cause a crash. And with KC Sun, I think my recommendation then is if you're writing new code, write rigorous stress tests that will execute lots of different threads and keep the concurrency off the code to be tested in mind, and then enable KC Sun and hopefully that will really weed out a lot of the concurrency issues. Right, that brings to mind the media subsystem. I had to write a test script to make sure that the resource is not being touched. Driver structure is not being touched after user-free type error, kicking off five or six instances of the script to just stress test the driver device file. So yes, that's kind of what you're saying. You have to write, you have to think about concurrency and you have to think about what are the points vulnerabilities in your code and go at them with writing tests and then also use KC Sun and other tools to make sure that your code is solid. Yes, yeah. And KC Sun also has various knobs and I think in the kernel documentation, so KC Sun, kernel doc has also documentation on the various knobs that KC Sun has to tweak the performance. And also I think actually I can click on this and then I hope that you can still see this window. Here actually in this one section on tuning performance. So this is the one that I wanted to mention. So I think by default, the default parameters that KC Sun has are very good. They, I think that's the, I would just leave it and run with the default parameters and then try and find as many bugs or data raises as possible with that. And then once that is exhausted, I think then more advanced uses of KC Sun would then consider tweaking these core parameters. And in particular, the one that I would tweak is skip watch parameter, which is the number of per CPU memory operations that KC Sun skips before it sets up another watch point. And so it's the sampling strategy that KC Sun uses. And with this one actually, you can actually do this on the kernel. Like if you're in a terminal, you can just pipe a new integer into this module KC Sun parameters and skip watch, for example, after this, that would be changing the skip watch parameter. And it's nice to actually play with this. And you can then see how also the kernel might become slightly less responsive if this value is lowered or it becomes slightly more responsive, if you increase it. And then I think one also effective strategy and that's also what I have done in the past is to just write a shell script that randomly changes this value in the background. Although the skip watch parameter is already randomized, but I think to just keep the performance predictable in KC Sun, we've chosen to not change the value too drastically randomly, but rather leave it to the user, which can then in user space, set this parameter and change it on the fly as needed. And then also that way, perhaps improve KC Sun's performance. But by default, I would at first not recommend tweaking these parameters. Thank you. Any other questions for Marco? If you're watching the chat, thank you for flowing in, Marco. You're welcome. I hope it was useful. And yes, please let me know if there are any more questions. I can also try and write. So here, as I talked earlier about the assert exclusive macros, I think those are, again, those are fall into the more advanced uses of KC Sun, but I really would like to see more uses of the assert exclusive macros in the kernel because I think they can really help find more bugs, concurrency bugs in the kernel. And here, again, this is the same example that I showed earlier. But I think these are very interesting also for, I fused it in, I think, one new code that I wrote and I also use these macros. It also helps to document the intent. For example, if it is not easy to express, let's say a certain variable shouldn't be accessed concurrently. If it is, again, it's a marked access. And then it helps to document this intent rather than writing this in documentation. That I, like in this example, so writing in documentation careful, there should be no other writers using these macros. Thank you very much. This has been a very useful session. It'll help a lot of people, new developers, especially coming in and going, what do we do? How do we figure things out? And even people that have been doing this work for a long time, they want to find out what kind of concurrency problems are lurking in their code. So this is very helpful. Great, yeah. Thank you. Thanks for having me. And I think if there are no further questions and I think it'll also be time to hand it back to you. Thank you, Marco, and perfect. Thank you, Marco and Shua, for your time today. And thank you everyone for joining us. As a reminder, this recording will be assembled Linux Foundation's YouTube page later today. And a copy of the presentation slides will be added to the Linux Foundation website. We hope you are able to join us for future webinars or future mentorship sessions. Have a wonderful day. Thank you.