 Test one to test one to see this seems reasonable, right? So you can use this microphone and we should be good It was many a many a year ago in this kingdom by the sea that a maiden there lived whom you may know By the name of Annabelle Lee And we loved with a love that was more than a love I and my Annabelle Lee Once upon a midnight dreary while I pondered week and weary over many acquaint curious volume of forgotten lore as I pondered Nearly napping suddenly there came a tapping as of someone gently wrapping 1115 can everybody hear me now? Excellent, so I'll start it off by saying the one thing I must not forget to say which is hi dad okay, so Thank you to the organizers for accepting this presentation Which is about deferred work in the Linux kernel? And thank you everyone for attending this morning when there are so many other talks you could have heard Before I get into the material, I'll start out by saying that all the demo code and the Linux kernel and such Artifacts are at this github URL In fact, you can download the demo code and run it yourself on your machine if you have properly prepared it The latest version of this slides will always be at this URL. It really is HTTP rather than HTTPS So to dive right in let's talk about what deferred work is is it too loud? Is it too cold to a little bit louder? Okay? Okay. All right. How's that? Okay, I'm not a good singer So like this Okay, I'll get this before we're done Okay, so to get into the technical material now Let's talk about what deferred work is and thank you audience for your participation so Deferred work comes basically in two categories. There is work that Was running or could run But doesn't because of resource unavailability and the resource that is lacking could be a lock a buffer a message And so that work is waiting Then there's work that is newly needed in response to an event and this work is typically run by a function called a callback and so Just to make more concrete what I'm talking about Here is a seasonally appropriate example Your accountant might say upload your tax documents But rather than do that now if you're in an uninterruptible fund state You might in queue the work function on a list to be performed later So this is an example of deferred work and of course deferred work is not unique or invented by the Linux kernel It's a old pattern in software engineering and used by all kinds of programs Well, let's turn to the kinds of deferred work that are important for the Linux kernel So for the purpose of this talk, there are two major Performers of deferred work. There are soft IR cues So called because they follow on are the bottom halves in the parlance of hard IR cues, which are actual Interrupts and then there are key workers, which are executor of work, which has been placed on work cues And so the first half of the talk will describe soft IR cues Which are a legacy mechanism added in 1992 to the kernel the file soft IR cue.c actually has Linus's own copyright at the top and then Key workers, which also are a very old feature of the kernel, but are more flexible and newer and And so why would you care about deferred work at all? Why would you ever think about it? Well, the answer is sometimes it Causes problems on the system and so if you're a kernel caretaker or a kernel developer, you may have to look into it One example of something that can go wrong is that a task gets deferred too long While the work that is performed by soft IR cues and work cues Can be put off and shouldn't be performed for example in interrupt context. It can't be put off forever So RCU stands for read copy update. It's essentially the kernels garbage collector that frees no longer used memory You can guess that while you could delay freeing unused memory. You can't delay it forever or actually you run out of memory strangely another sign that Work is being delayed too long is when heavy network traffic makes the System kind of fall over or freeze up That may be because Network soft IR cues for example are delaying other soft IR cues and much more about that in a moment so the kind of opposite problem of too long deferral of Work is that the executors of deferred work like k workers or k soft IR QD Could run and run and run and run and meanwhile the user space applications Who's a serving serving the user space applications is the purpose of the system? And it doesn't get to run because these callbacks are running and so you have to look into that oddly another sign that Divert work is running too long is that the watchdog timer associated with k workers can fire and that happens when a K worker is sitting on a core and not yielding and so if You work on systems that are very busy you have to look into this kind of problem and the real pain point About soft IR cues and work cues is that they gather up a lot of different kinds of tasks together and It's often been hard to figure out exactly what they're doing and it has been hard to figure out what to Do about them when you have problems So the purpose of this talk is really to open this window and give you some tools to investigate these Work deferral mechanisms and to some extent do something about them the first half of the talk is about soft IR cues which Remain difficult to control and difficult to understand and the second half of the talk is about work cues and That is full of good news so that I'll give away the fact that the talk has a happy ending But with that let's dive right into soft IR cues So what are the soft IR cues really? Well, there are 10 to lift 10 delicious flavors These are the order in which K soft IR QD will execute them There are high-end tasklet soft IR cues Which are first in order then there are the network soft IR cues which are kind of obvious There are some associated with a block layer for those who don't know the block layer is what connects memory and file systems There are some run by the scheduler There are the rather obscure IR cue pole ones which are used as far as I know only by network attached storage devices There are two kinds of timer soft IR cues and whoops There are some associated with the read copy update Garbage collector and so without knowing anything further about soft IR cues You might say ha the network receipt Soft IR cues are above the timers in this list and feel some foreboding And so the fundamental design that we have a hard IR cue and then we have a soft IR cue that does the work That it is indicating needs to be done sounds like a good design But there are problems. In fact Soft IR cues just between us are not the most beloved feature of the kernel. Here are three money quotes from well-known software developers of Frederick vice-vector has said that soft IR cues are a pain to deal with Thomas glasner who's the dean of real-time Linux has called K soft IR QD the big soft IR Q lock and He has commented that Soft IR cues have heuristics, which he finds disgusting So I will brace yourself. I will now show you these disgusting soft soft IR cue heuristics There are two limits, which according to the Documentation have been established by experimentation. They are shown right here one is the time slice Maximum that case after our QD can use and is one is the maximum number of times that case after our QD will Run through that list of 10 soft IR cues These numbers are hard compiled into the kernel and since you all know undoubtedly that the kernel runs everything from smart watches to sprinklers to Supercomputers to Android phones. You might be wondering Huh, why are these numbers perfect for all those use cases and the answer is they're not and no one knows But no one has the nerve to change them. They've fitted the kernel for a long time This situation is particularly painful because we continually have improvements to scheduling in the kernel There's going to be a talk later on this afternoon About some new improvements in scheduling in this very room That I'm looking forward to but the scheduler is can't varies these numbers because they are they are compiled in So this is not a great situation But furthermore, let's that's not the really bad problem with soft RQs Let's talk about what the really bad problem is so the only one soft RQ runs at any given time on a given core and What that really means is When a hard IRQ occurs and as we say in the kernel Raises the associated soft RQ the runtime will check if the another soft RQ Is already running so in other words if this hard IRQ has preempted a running soft IRQ then The soft RQ will have called the function local bottom half Disable which is the the villain in this part of the talk so if no Soft RQ is already running then the happy path called the piggyback occurs and the function do soft IRQ Will invoke the soft IRQ that the hard IRQ indicates is needed However, if a soft RQ was already running Then it has taken this lock. It is called local bottom half disable you recall that soft RQs are also called Bottom halves and the soft RQ will have to wait for the case of the RQD thread To execute it so in other words the soft RQ may occur actually in interrupt context Without a context which right after the hard IRQ or it may occur some time and the reason I say some time is The next time case soft RQD runs it could run out that time slice and never get to this soft RQ So the waiting may be for a while So that is the big soft RQ lock that Glacier is referring to I want to talk a little bit about Kind of briefly about how you can tell what case soft RQD is doing So to do that I have two pieces of demo hardware here one is this very laptop which I'm running a kernel that no sane person would run on a computer used for a presentation and one is In arm 64 8 core IMX 8 board running kernel 6.1 and So the laptop is running kernel 6.7 and What I'm going to show you we have to put down this mic for a second. Hopefully it won't make a loud crashing noise is They function called Stack count that is part of the Libby PF BCC tools suite It's a package you can install on your system And what stack count does is it prints out stack traces from the running kernel? That end in a particular function and so I showed you in the previous slide how do soft RQ function is actually the executor of soft RQs, and so that's the function we're going to trace and the The demo is going to be run by stack count via a shell script I'm running all these demos via shell script just to minimize the amount that you actually have to watch me type with the demo is live and What is going to happen is that as With all BPF programs Live LLVM is going to jit compile some C code. It's going to submit it To a validator in the kernel Which if the program is found to be valid will load it into a virtual machine inside the kernel and Then it will spit out these stack traces that end in the function do soft RQ and show you What the paths of execution are leading to calls of soft RQ do soft RQ? So if this is all sounds confusing here's some warning messages It's always nice in the morning and now I'll run it on the 6.1 kernel on the arm board And so the live LLVM is actually jit compiling this code and inserting it in the kernel And you can see the laptop is faster than the arm board And so here are our stack traces. There were 6,200 occurrences of this stack trace during that time period and this stack trace refers to CPU idle, so I believe that it is a Scheduler soft IRQ if I can do this one-handed here's one that says send message. So this is clearly a network Stack trace or 914 of these and here's the here's the result from arm There here's another CPU idle. So this is another scheduler soft IRQ All these functions with EL One EL zero these are all arm assembly Would be great if I had a handless like But hands-free like but whatever So here's more TCP stuff. Here's a PCIE Function that is calling do soft IRQ you can see here's local bottom half Enable that is freeing the lock right right before do soft IRQ runs and so forth So this is the best way I know to see what? Ksoft IRQD is actually doing on a running system so Enough about that. This is the demo you've just seen Let's talk a little bit more about the big soft IRQ lock and why it is a problem for real-time Linux So this shows why local bottom half disable the big soft IRQ lock is a problem Another title for this slide could be F trace output is hard to understand But let's work with it. This slide actually comes from Sebastian Severe who's one of the main Developers of RT Linux from a talk. He gave it Linux plumbers in the fall Fundamentally what we see here is that a network soft IRQ net Rx is running a Sita interrupt which is higher priority comes in and Would like to run the block soft IRQ, but it can't run even though it's a Rt system and the highest priority thread should run Because the network soft IRQ has called local bottom half disable So what actually happens is that the scheduler increases the priority of the network? Receipt soft IRQ it finally finishes it may take a while and then finally the block soft IRQ gets to run and so this problem with local bottom half disable and non concurrency of soft IRQsology that we we in this room in theory at least can improve so So I wanted to show you a timeline of the Great laborers over the years of attempts to improve soft IRQs starting with Colonel 3.8. You could run RCU the Garbage collector Callbacks in their own K thread and not inside K software QD. That means you could pin these threads to cores You could increase their priority Obviously changing the priority of K software QD doesn't do you a lot of good because you're changing the priority of all those soft IRQs together Starting in Colonel 512 you could run the network callbacks in their own thread And move them out of K software QD Ah, there have been Two attempts To move timers out of K software QD if there's any callback you would like to be timely presumably it would be a timer callback And there have been two more recent ones One was a proposal by Frederick Weiss mecher and colonel 6.5 to mark individual timers as Able to be run at the same time as other soft IRQs that ran into lock depth problems and didn't get very far Another one called Local lock nested that was presented in the CVR talk from plumbers that I just showed a snippet of this is very clever It actually uses scoped locks the Colonel has scoped locks now So we could in theory remove all the go-tos from the Colonel, which would just be amazing This was an attempt to move Get basically get rid of local bottom half disable and replace with actual locks close to the data That local bottom half disable is protecting so in good software engineering practice We want to have locks right next to the critical data sections that Can't be touched by multiple threads at one time But problem is is that local bottom half disable is scattered throughout the colonel and then finally Just merged in six point nine this week there is some work by Tijin howl the maintainer of Work queues and C groups in the colonel to change tasklets into Users of the work queue API so the the tasklets are going to stay Run by case after a Qt, but their API is completely changing Tasklets had a a global use after free problem Which if you don't know what that is it's a big headache for security and so There's work underway to basically change all the tasklets over from Using the existing API to work queues and this change was Linus's own idea So it's not amazing that it has merged and I should say that Tijin howl the author of work queues maintainer of work queues and C groups will be speaking in this very room later on this afternoon and So I I'm looking forward to hearing that So with that We are Good God, that was not what I wanted to happen. I think I clicked on a hyperlink all the text on the other text all these slides That isn't blue as a hyperlink There we go. So to summarize the part of here about Softire queues There are about 250 call sites for local bottom half disabled in the colonel so getting rid of it is no small task as People have commented nobody really knows what this lock is is protecting That was the same feature with the big colonel lock of old which took years to to remove It is possible to run RCU Network and in the real-time colonel timer Softire queues and a separate thread Which makes it more manageable and easier to observe But if you do that, then you're incurring a context switch penalty Every time that thread runs and that may actually make your system performance worse So for all these tunables that I'll be mentioning As with any colonel configuration you really have to run Performance tests relevant to your workload to see if the changes will help and as I said, it's Not so user-visible, but a big improvement is coming to tasklets in Colonel six point nine whose merge window is still open Because they will be changing to the work queue API and work use are just an all-around better design that we're done with soft our cues and if anybody has any Questions about soft our cues before we move on I'll be I'll be happy to answer them There's no room to put the mic stand next to all the do-dads. I have on top of the podium For your demo, what were you actually running in the script? You were tracing syscalls, but like what were you actually running in order? Were you tracing? Let's go back and we can just look at that real quickly again. As I said all these All these are at github Slide Here's just a slide that's a substituting for the Little app demo. So it's actually running this upstream tool stack count It prints out these stack counts along with the number of occurrences of that exact stack count And so you can put any function you want here that's not running You know, that's I don't know There are very few functions you can't put here And so you can trace any function you want using this upstream tool Just to confirm the bottom half lock is on a per core That is correct. So two of Instances of soft our cues to block soft our cues could be running on two different cores But on a given core there is per core data that can't be corrupted by two threads And so the the crazy part is is it really not possible to run block device soft our cues and You know scheduler soft our cues at the same time what what per CPU data could they have in common? And the answer is I said with 250 call sites of that function is no one actually knows Do you gain anything by trying to lock your soft IR curious to a particular core? So soft our cues are per Core threads and so at boot the kernel creates a case after a QD thread for each core And that's fundamental to the kernels design. So there's Yeah, and so a hard RQ on a given core will raise the soft RQ on that core So if you have a real-time Linux, you can pin the interrupts to different cores But if I start talking about that, I'll never talk about work is Any other burning questions before we talk about work is Let's get to let's get to the good news part of the talk So the hardest part about this presentation Has been okay, so the heart of that part about this presentation is that Work cues have changed so much in the last 18 months that I've had trouble keeping up with it and preparing us talk Let's talk about what work cues are. So here's a diagram to explain the fundamental idea of work cues Unsurprisingly we have lists of work functions, which are pointers to the work that needs to be performed Each one of these work cues will belong to a different kernel entity so it might be that You know some file system for example owns this work queue and that file system driver is free to In queue all different work functions on the work queue and but a different You know a crypto driver would have its own work queue with its own work functions Nonetheless these two work cues may be associated with the same worker pool Which includes the K worker threads that actually perform the work on on behalf of the drivers and once again the bound work cues that I'm talking about here are Kernel per CPU threads So at boot on each core the kernel starts a high priority worker pool and a default priority worker pool and The K workers have names like K worker core number and then just an instance and When you create a work queue the work queue is associated with a worker pool a worker pool Based on the flags that you set on it And so if you set a high priority on your work queue then it will be assigned to the high priority worker pools nothing surprising about that Let's talk about the names of K workers you see in the process table I Sort of described already the bound K workers. There's a fixed number of pools to per core The kernel will spin up and spin down new K workers as they're needed and these K workers are bound So they can't migrate in contrast. We also have the mechanism of unbound K workers Unbound K workers can run on on any core by default but that is configurable as I will note in a second There is a pool ID, which is just a number as is the K worker ID So every time you reboot these these numbers may be different the unbound pools are for long-running persistent work That some are created that boot for system services and some are dynamic related to device drivers these Work items can can wander from core to core There not only is not a fixed number of workers per pool, but there's not a fixed number of pools And so once again, why might you care about all this stuff who cares about this work use and K workers and the answer is if you are Responsible for maintaining the kernel you may be presented with a Jira ticket, which has a splat like this in it where the Work queue core informs us that some K worker has been running and not yielding for 207 seconds, which is a lot of computer time and that Anti-social K workers associated with pool 112 and the splat will show you This is a real data from 515 kernel There were three work queues associated with that pool at the time the crime was committed One was the IXDBE Network driver one was some file system cleanup tasks and one was a device driver whose Name has been redacted You can see that this isn't in fact an unbound work queue because it could run on any of 56 cores and the reason that these three work queues are in the same pool is because the flags of each of them ends in the bite a which is Some flags that describe the configuration of the work queue now if you have experience in maintaining The kernel and dealing with threads that are conflicting threads that are hugging cores You might say, huh, I'll either change the priority of the thing that's causing the problem Or I'll pin it to a core away from the work that I really care about and so you might use CHRT to change the scheduler policy You could use renice to change the priority or you could use task set to set the core Affinity of the problematic thread and all three of these methods won't work I So I'm going to show you this once again by coming over here to the hardware So the thing you'd like to do is use task set to set the The CPU affinity of a thread written as program called classify process affinity You have to type the names of the programs. That's the only thing and We'll ask classify process affinity to read affinity of K workers and So on this arm a core board of Work or board here are these kernel per CPU Bound K workers at the top with the names K worker slash core number down here at the bottom We have the unbound K workers. They're in a Default priority and a high priority pool You can see Right away that K workers come and go Because the numbering of these K workers in pool 8 is not consecutive. They're Presumably used to be a u8 to and the wasn't being used so the kernel got rid of it But the real point of this demo was we can see that all these K workers are Unpinnable and that seems really odd Because I just told you that the unbound K workers Can migrate so if they can migrate Why shouldn't they be able to be pinned that seems to not make any sense and the answer is Going to be shown in a second here. I just want to Just want to compare and contrast the situation with K workers on the newer kernel 6.7 You can see now there are K workers called slash are Which are rescue K workers? That reclaim memory when the system is getting low so they prevent the system from ooming and Let's go over to back to the slides So the way the class is classified process affinity works is it reads struck test flags from Prokofs What do I mean by test set and chrt and re nice Managed the wrong thing what I specifically mean is that what you really care about is your work Not K workers. You don't care about kernel threads that perform the work you need What you really won't want is for this work to get done and so the work queue API Quires a change of mindset you actually configure properties onto the work queue onto the work and then Work queue core provides K workers that match the attributes that you want so test set and chrt make make System calls into this kernel scheduler and the kernel scheduler really doesn't manage work use because it manages threads You need to use the work queue API that manages work And so I'm running out of time apparently, but I'm going to show you On x86 6.7 here a demo which should make this clear Iconify the so this demo Shows you how to Illustrates the use of the Okay All right Perfect. Okay. Now. We're going to jet through this demo. So this demo is to show you how to manage Work queues via the work queue API not via the scheduler and how you bind properties to the work queue Not to the key workers about which you don't care. So this demo won't work before kernel 6.6, which is why we are looking at it on Actually, let's start this over again. Oh, yeah, okay, which is why we're looking at it on Intel Here we're looking at a listing of sys devices virtual work queue. There are Five work queues in there one associated with block control groups three associated with MVME and one associated with flushing of dirty pages back to the backing storage and You can kind of guess from seeing these all these storage devices one Work queues exported to SysFS that unlike software queues Where it is the network stack that tends to cause problems for other work. It is the block layer in Work queues and here are the tunables that are Setable via SysFS. There are two associated with Affinity of the work queue notice is affinity of the work and not the K worker again One associated with concurrency. There's niceness that you know what it is I don't have time to talk about it, but the Treatment of affinity and the kernel with work queues Has really improved in recent kernels the default affinity affinity is per cache line Now it used to be per pneumonode. So this is a big improvement once again in performance by default We can see that the default niceness is zero I guess that might be off the bottom of the screen. Oh, well tough to take my word for it now we're using the dragon debugger from meta to run a Script called work you dumped up pie And we're asking it about the MVME delete work queue. We can see that the MVME delete work queue is Running in work pool 16 on all cores because it is unbound This demo is a little bit slow because it really is running in real time. And so now We're asking dragon to query the kernel What else is running in work pool 16 besides the MVME delete work queue and the answer is a whole bunch of stuff Which is unrelated related to thunderbolt related to graphics and so forth and so on And we're echoing for into SysFS to change the niceness and we can see that The MVME delete work queue has changed to a different worker pool. And so as I said, this is a I'm just gonna skip the rest of this And go back to the side so I can finish And this is what I've just shown you is that you can configure work queues via SysFS and not via the functions that you're used to and work queues have Just had one improvement after another in 6.5 6.6 There's also a new BPF BCC tool called work queue lat which is not even part of the kernel source repository, which has just appeared and so with that I have told you that software queues run in Interrupt context they can be fast under the best Circumstances, but often they are slow and the opportunities for configuring them are are low And people Have been working on it very hard and progress is the lack of progress is frustrating work queues run in process context You manage the work queues not the key workers about which you don't care Their observability configurability and performance has improved greatly in recent kernels I think Sarah Newman for her suggestions about this talk and there is a huge amount of Further advice and resource information in these slides which are available at the scale website So with that I thank you for attention, and I think I have time for probably about one One question, but I'll be around after the presentation if somebody wants to ask I have which I've got 20 minutes. Oh man. Who wants to see about half dozen more demos? Or actually who has a question first well, I'll just show you So the when I changed the niceness of this work queue it changed the pool 18 So we were able to change the pool of the work queue by setting the niceness and you might say who cares Well, if we ask Dragon now to query the kernel We can see that in fact the pool 18 has niceness minus four So the reason this NVMe delete work queue is in pool 18 now is once again its attributes match that pool and what else runs in the new pool 18 pool 16 had like 15 work queues associated with it the answer is nothing so NVMe delete work queue has its own Its own worker pool So that's great right that's that's wonderful, but actually it might not be great and I'll answer your question in just a second. The reason it might not be great is switching between work queues When when a key worker is running Doesn't require a context switch. Context switches are very expensive in In modern kernels and so it could be by putting your your work queue in its own pool all by itself That it got more latent so there's always a trade-off in Performance CPU utilization and latency and there's no perfect solution and it depends on your work queue And you have to you have to measure to find out if you're making something better So let's we run the microphone over to the thought that has a question and Tom I think you left your sunglasses up here Could you elaborate on what niceness means? How exactly niceness changes? Where the work is being moved? Right, so I mean if we would have changed the question is How niceness affects where the work is being moved? so There was no work or pool with niceness minus four or eleven or something like that, right? so the Colonel will just create new work pools If there is no work pool that matches a Work queue 18 may already have been been there and just been idle for all I know I'm not I'm not actually sure The niceness then does just what you think it does it Increases or decreases the priority of the key workers And so it's a communication between the work queue API and the scheduler, but it's Very odd because you're setting this niceness on a data structure essentially right and Niceness is the priority of the thread so The work queue API is really weird The more you think about it the more it actually gives you the tools to do what you want Because no one really cares about k threads at the end of the day. I'm not sure if I answered your question Okay, so a threaded IRQ can be set affinity to a particular CPU You can set that this seems like now you've set a Particular work queue to be able to also have affinity to a particular core. Is that correct? It's one of the eyes one of those parameters, right? So the question is you can set the affinity of a work queue to a particular core That is correct if you set the the niceness to zero which is the default or the niceness to 20 minus 20 Which is the high priority one and you set it to particular core. It's gonna run on one of the bound work is But it won't stay on the same core all the time it will oh it will If it you can set the affinity Let's go back and look at that. Yeah, it's too small. I can't see it. That's right. I will make it larger. Thank you So if we say this devices virtual Work queues look at right back Yeah, she'll get it bigger. So here we have two parameters affinity scope and affinity strict Affinity scope can be per CPU per cash line per Numa node System-wide or there's one called SMT which has to do with hyper threads on a on a given core and Cash is now the default which is great. So Unbound work will migrate only within the scope of a given cash line by default Then there's this affinity strict Which is a parameter where you can express If my work can't run because it's current affinity scope is so busy Do I want to allow it to migrate to a different affinity scope and? So You also can set the CPU mask Something I haven't said is when we looked at the list of Unbound work queues associated with pool 16. There's like 15 of them But there's only a few work queues in SysFS and that's because we're only the work queues that are created with the flag WQ SysFS I think appear in SysFS So if a particular Work queue is giving you heartburn one of the things you could do to make a tiny kernel patch and just turn on that flag and then it'll be there in SysFS and you can You can change it so with the By default Any work queues that are created by drivers will just be run by the the bound work queues And there's another flag WQ unbound that assigns them to the To the unbound one question. So a lot of this is stuff. I haven't played with is there a particular Book or blog series or video series you recommend to get more familiar with work queues and Yes, the question is what are the best resources for for work queues I As I said, I put some stuff that I found useful in here. There's some Also some all these are hyperlinks here if you want to read about these features I have to say that the documentation for work queues is excellent people like work queues the Manual for software queues is still missing unfortunately because people hate software queues They want to change them. They want to get rid of them and no one wants to write about them On this slide are three different tools for observing work queues Work queue monitor, which is kind of top-like. There's work you dump which dumps out a huge amount of information I've been piping it through grep and it's because it produces pages of output And I'm only grepping out the one line that it's it's slow and then there's this BPF CC tools new tool called work you lat which prints out the amount of time elapsed between a work you wanting to run and actually run in the spirit of the whole family of BPF CC tools that report about latency, so You could learn a lot just by reading the kernels entry documentation and using these tools And then you get get most of it as I said all these Dem I have a bunch more demos That I don't have time to show and you could just you could download all of them and run on my All right, thank you, and thanks for a great talk Hi, I hope this isn't too off-topic, but in terms of other ways of dealing with real-time Do you have any notion of what the disadvantages might be of like on an arm? System using attached cortex M course to do your real-time stuff. Is there are there disadvantages to that? Okay, so for those the question is are there advantages to using cortex M course which may be part of arm SOC's arm SOC's systems on chip tend to have high performance cores and lower performance cores sometimes in pairs And so you can assign some work to the cortex and lower capability cores certainly using the cortex M cores to offload the kernel Say from handling network packets, you're just a kind of any kind of Workload that tends to thrash the kernel by having a lot of interrupts you could possibly assign to the To the cortex M. Well, the reason is to try and Make it more hard real-time. You know like if you're trying to Run a PID loop running it on a time-sharing CPU can give you less than ideal results because PID loops tend to assume that you're always measuring the system at a constant rate and Even with the the real-time Linux stuff that's not really hard real-time. It's more soft real-time. Yeah, so Bill is talking about PID. He means Proportional integral derivative Servo loops or tank circuit as we used to call them. Yeah And could you run a control loop on a cortex M and sure that something like that that just is a very well-bounded? Task would be perfect for a dumb device sure Yeah, and basically my question is is there any downsides to doing something like that if it has to do a lot of communication with the Other course then you incur a Yeah, okay These two are software cues and work cues interchangeable in a system such that you can disable one and your system will be fine or is the kernel in such a Place where you have to keep them both enabled So the question is can you do without? Software cues or work cues they're very both very integral to the operation of the kernel they appear if you look at the kernels main function in the in the directory in it There's actually a main function a main dot C with a main function in it and it actually starts the work Cues and software cues there because Colonel really can't Even ignoring user space and the actual work you want done it can't be without them. So they start really Quite early after the system powers on You can you can fiddle around with them a little bit, but you're kind of stuck with them. So You just have to deal with them Yeah In your top in the description of your talk you mentioned that you also have been using BPF trace I was just wondering what sort of information that you've been trying to view through that. Yes. What are your experiences with it? Right, so I've written Thanks for the question. I've also been playing a little bit with BPF trace to peer into the heart of Work cues and soft arc cues Soft arc cues what you can observe with them is really limited and the reason is because they can run following the hard arc you in atomic context and interrupt context and so if you know what interrupt context is you can't be printing from interrupt handlers or You can't have trace points inside interrupt handlers. That's why the soft arc cues are so inscrutable but because the work you run in In process context you can have all these nice things So let's run. There's one here called so Actually, well, yeah, let's run this one So here's Well, let's cat it. So here's oh, I put some comments at the top, but here's here's the entire BPF trace Program So it's gonna print out this little header and then it's gonna print a table It's easier just to this but this is the whole this is the whole darn thing So BPF trace is simple all these functions tend to Watch themselves So if I don't rep out L-disc, it's just gonna tell you that it's printing to this To this console, which isn't very interesting. So once again See function is being jit compiled by Lubell DM. It's being inserted into the virtual machine in the live kernel and Here is some output. So here is the CPU mask of some different work use which are running Here's a memory management per CPU work you here's a least recently Used add drain per CPU. So it's cleaning out the page cache. Here's a so This memory management work you is them By gripping out eight we're looking at only at the bound ones. So it's running some memory statistics and then if we look at the unbound ones We see lots of events work use running here's some associated with the Intel graphics There's Ordered and unordered work use with just yet another Wrinkle and here you can see the work they're performing Flushing fleshing the bump the front buffer. So that's blit for people who follow graphics and so forth Console callback on the event. So once again, it's sort of observing itself here, but Anyhow, it's easy to cook up these these BPF tray scripts, but dragon and Debugger and these Great Python tools in tree and the kernel came along while I was writing this and they're Better and they're maintained by Togen howl who maintains work is so Kind of no need to all your own anymore. I would say it's easy to do Just sorry just to follow up. So would it be easy to for instance to See what wasn't actually getting scheduled in the same worker pool that you were interested in and then See like if there was something overriding what you were Function you that you were interested in scheduling if you if you play around with these different tools So you work you monitor Which is in the tunnel. This is the top-lake tool. So what once again? This is running dragon and it's just running kitchen house Python here if by some combination of these tools You really can you can really fair out everything And you can always make Little BPF tray scripts to watch the work you that you're really interested in But this this is now so easy compared to what it was when I started working on this talk. It's amazing And also there's this automatic CPU intensive detection Long story short the splat that I showed you in the beginning. I think it might not happen now because internal automatic detection of Long-running work use has improved and these observability tools Didn't exist in kernel 515 when I had to investigate this bud And so this problem is just so much easier than it was 18 months ago. It's it's amazing So this part of the talk is all good news so are you aware of any tools that or any cases in The field actually where people actually migrate to new work pools and groups if you will based on observe latency sort of a Scheduler that runs in user space that manages this stuff automatically or is that something that's just not yet being done? Because it would kind of be counterproductive Yeah, I think I think that's a lot of this stuff is so new That people haven't really rocked it yet in some sense even though Keeping up with Teejun house and his co-workers most recent developments was problematic for the preparation of this talk It's also a sign that Great opportunity because there probably haven't been that many presentations that talked about all these features so far but But yeah, I like a lot of projects my project isn't Up to point nine yet, but when we get there I think there are real opportunities to use all these observability and figure ability features to make things Better or of course worse. I mean if you just say oh, I want all my all my k-workers in separate pools And I'm gonna Move everything out of case off the RQD. You'll end up making a mess. What you really need is tasks to characterize your workload and It's performance and there's no substitute for doing a lot of work no matter how many Gives you more rope to hang yourself with at the end of the day Test test test. Okay. Cool. That's is working. All right. Hello everyone Thanks for coming to this talk. I've titled the talk a sequel approach to exploring off objects It's like the start of an idea. I've been pulling on I'm gonna present some of what I've been working on There's a lot more to it, but so this is just like the Genesis. I Think I've shown it to a couple people and they either think it's crazier Stupid or awesome. So, you know, you'll have some opinion on it But you know, please suspend your expectations for a bit. It's a pretty small group So just ask me questions in the moment if you want just raise your hand I'll take a question. You don't have to wait till the end or anything like that And if you'd like to reach me after the talk, that's I have a couple different emails. That's one I'm also on like X Twitter and all that other stuff, but yeah Okay, just really quickly who am I that's a photo of me. I Don't I have three young boys. I find that important to share. I live in California. I'm a student at UCSC Studying computer science. I also work at Google. I'm doing nothing related to this talk at all at Google and I'm also at scale for Nixcon, which was here the last two days. So if you were part of Nixcon, whoo I'll try to have a quite a long time, but okay, so I thought I wanted to start the presentation of like the journey that I'm on and Why I want to look at this problem and it's Like I wanted to share a couple insights that are driving my thinking so these are insights that are basically well known to us all as practitioners of software, but I think some of us maybe have forgotten the scope and Then grasp the scope So these insights are the fundamental fundamental motivation For my research specifically for my PhD research and I want to bring you along for the journey So basically starts with Eric Raymond who published the paper in 1999 where he described two philosophies for producing software The bizarre model or the cathedral model So in the bizarre model developments done kind of in the wild And the cathedral model is done closed source like monolithically I want to make the case that the bizarre model has clearly won out against the cathedral model And we've seen an explosion of packages and open source libraries by authors So on the left is a graphic of the number of available installations Prolinx distribution as of October 2022 So it's a bit old, but it's kind of this driving the same point and it's from the site Rappology comm that tracks multiple different distributions on the right There's a different metric specifically for Debian, but shows the explosion on a temporal dimension Starting from 2004 and you can see the rise of submissions of packages Basically, I just want to drive the point that there's a lot of software out there so on the left Nick's packages has Over eight, you know, I think right now it's over 80,000 packages installable and There's about 200,000 submissions for Debian These are also packages where the authors have gone out and painstakingly modeled their software put it for installation So, you know, just there's probably another order of magnitude more of just available soft like software out there Conversely another data point published by github has that which has become like the de facto open source Place to place your code is as of 2023. There's 300 million repositories out there 50 million which are public So there's like a proliferation of software this is a garbled mess and If you can't make sense of it, that's like fine, and that's the point So it's the the build and runtime dependencies for the Ruby interpreter, so I generated this graph with Nick's which models exactly all the build dependencies you need I Just want to convey to you that like softwares become immensely complex beyond our understanding and it's like hidden. I Picked Ruby because it's actually a very innocuous Program, it's available on every distribution. It's written in C It's has very minimal actually dependencies and yet that small dependency graph is in fact this So this insight is complexity so in the last two slides I showed that there's a huge the sheer quantity of software has exploded and Maybe not surprisingly as a result. We're building software that even when perceived to be simple is incredibly complex and requires many dependencies The final insight is that all those software has gotten more complex and numerous our ability to specify it The specificity of it has remained low as a proxy. Here's a graph of all packages in Debian The distribution and how they specify their dependencies So this is just something I wrote a script to grab and just parsed out like the version semantics and the the package metadata So pathic. Yeah Sorry, can you read the quite? Yeah, I don't think that's a million is it 100,000. Yeah This includes maybe Library, okay. Yeah, so on this graph I have about a hundred thousand on a previous previous graph It showed about much less than that for installable packages for Debian I don't have a great answer for that right now. This must include Libraries and stuff like that just doesn't count as for the other package installable. Yeah So basically the the point driving cross in this is three quarters of all packages installed by Debian are unversioned So the problem has become increasing complex and having more dependencies and they're in fact under specified You might ask yourself. Well, then how does this all work in practice? These packages work because and really only because the maintainers of Debian Diligently and manually ensure that the full graph of packages in the distribution build link and work together. It's an incredibly impressive feat and It requires an immense amount of knowledge about the needs of all these packages and It's basically encoded within this Single graph that you can install when you use Debian Interestingly, this cost is actually paid by every distribution out there You as a user though are ultimately locked into a single monolithic package set so that it all makes this possible So there's three big insights diversity complexity and specificity You can see from the previous slides that diversity and complexity are inherent to our modern softwares ecosystem and in tension with specificity As practitioners we have to tame the diversity and complexity to do our jobs But as the Debian package versions versioning story shows it's often too hard to pull off if we had better tooling to track software metadata Specificity could help us tame that chaos So that's kind of like the broad guiding theme of my research. So let's see what I kind of did with that Okay, where does this complexity come from I kind of sat and thought about that and And at what point is versioning really necessary? So maybe unsurprisingly I looked at shared libraries, which to me are the most fundamental data management unit in Linux So shared libraries effectively allow Linux programs to share code They do this by assigning by assigning names to chunks of compiled code like functions and global variables These names are known as symbols They were invented in the 1960s when disk and bandwidth were at a premium back then when it was a benefit to having to upgrade a single file and Whether for it was network bandwidth costs or you know quick security upgrades Unfortunately sharing code always creates problems and opportunities for weird disagreements Included here a nice quote for a co-author of mine from a previous paper Which basically says the the cause and solution to all of life's life's problems. I Mentioned all the nice attributes. I'll be it outdated for shared objects, but what are these problems his quotes referring to I? believe we've hidden an inflection point and there's now an opportunity to rethink and revisit many primitives and President today's stack. You don't have to read this quote. It's from Linus. I just wanted to include it You can't I don't know if it's totally eligible But it's just like you know, he's known for taking a spicy take on things This is just one spicy of his many takes on things where he is himself calling out that shared libraries are Basically a problem. They're unneeded. They're causing way more problems than they solve Specifically in this thread it was all about performance So yeah, it's like not a novel idea that I have and in fact, you know the Originator Linux also shares the same idea as a practical example. Let me ask you as this question Have you ever had a problem installing a Python package? Specifically, have you ever seen this error? That undefined message means that one such symptom of the problems of shared libraries can introduce Python is taking on the challenge of participating in a large diverse and complicated software ecosystem Most of which which isn't written in Python That's what makes Python useful and extremely frustrating I'm not here to solve that issue and if you're interested in that topic. I hope you attended nixcon yesterday And because that could be a one hopeful solution These kind of problems make me think of this comic which although originally intended to poke fun at house Some seemingly unknown open source library is keeping the software world afloat. I think fits nicely into this framing as well And it's a great photo. It fits into like a ton of different framings But hopefully by this point my take is you all agree that while software is getting more complex and expansive It's also suffering from low fidelity and that's because it's hard to specify what we want and understand What's present and work with the system? The majority of the tools that we've come to use and establish to run and link software and introspect them were devised in the 90s or even earlier These tool chains associate objects with code at once So we should really kind of look to that space to figure out how to tame this chaos. I Think there's a lot of opportunities. It's like rife for improvements because they're untouched. We're all focused way higher up in the stack Okay, so if we're gonna solve problems with symbol resolution again the process of finding chunks to satisfy the dependencies at runtime Then we'll need to take control over the sources of information about symbols and the code they refer to That information struck stored in a structure known as elf or the executable and lickable format Shared libraries relocatable objects binaries. It's all elf right now You know, we're at the southern California Linux expo It won't surprise any of us to hear that I'll be talking about the rest of this presentation. We focused on elf If you happen to be on other distributions like non-linux, I Found that if you squint hard enough, they're actually all the same and they follow the same Format so it's the same so Mac. Oh P. E. They're eerily similar if you squint at them elf has dominated the Linux space since 1999 when it was chosen as the standard and In fact as of 2022 Linux kernel 5.18 dropped support for the predecessor 8.0 out The elf object format had like solved a lot of real problems So we can specify varying isa's a bi byte encodings It was adapted to be very Configurable and adaptable and to support enhancements without breaking compatibility. It's highly configurable and over time though Has become more dependent on convention But basically if you could see the photo It's a series of like containers and there's two types of containers sections or segments and the reference to by Name or a pointer in a header table, so it's a very generic format and then each segment or section has more meaning Yeah, but it's not by can on the standard The format's comprised of like I said either sections Which is if you use relocatable objects, you'll have that or segments, which is the X rule code such as in a binary complicating matters further is elf files could have both and There's an overlapping between the two The granularity categorization of data and to name sections by convention such as is totally by convention So there's dot text dot data. That's just like different tool chains have specified that and they've become Their convention, but it become law And also like bss has uninitialized data for instance Conversely the execution view represented through segments is crafted with a lens though for a performance so it's It's tightly packed Executable code that can be mapped into process space and then run The L file was designed with trade-offs favoring performance and size It's evidence from the terseness of the file format, you know like hex editing the format is very difficult rife with problems and surprisingly we do it often the Format also contains a lot of data structures for performance So it has a hash map multiple types of hash maps It has a bloom filter and some of the other segments have like different optimizing coatings What is the state of present? The what is the state of the art like today if you wanted to look at your elf objects as Software system grows in complexity You'd imagine the tools we rely on to debug and optimize to keep pace oddly enough the landscape of tooling to introspect and work With elf has seen a consistent reliance on a traditional Unix tools such as read-off defect depicted here Which basically just dumps the raw ASCII representation of the file These tools have stood the test of time and one can argue they here to the Linux slash Unix philosophy do one thing well But it leads to cumbersome commands to extract meaningful answers to questions one may have about data The status quo for investigating binaries data has basically gone unchallenged In an era where data is growing at an unprecedented scale. We've seen software burgeon in size and complexity There is an impending need to reevaluate our tool set Performing more advanced analysis on elf at the moment requires software authors to familiarize themselves with the elf format Because they need to link in like lib elf and read it directly Each tool which seeks to ask questions like nuanced ones about their elf format have to do so on their own Leveraging at most a library which helps parse the file format Anecdotally, I found most software that needs data from elf to choose basically just to shell out instead to read-off So you'll have like some software and it just Spons a sub-process and does read-off and greps out the information it needs There also remains no comprehensive library that supports modification of elf files There are targeted tools such as patch elf That can change the interpreter and a few other things But so they have pretty restricted editing capabilities and the tool has seen its fair share of bugs from keeping edits in the file consistent a Common meme like amongst Unix programmers is that shell scripting specifically awk can solve anything There's a lot of truth to this in fact Brian Kennegan's book the awk programming language had an implementation of relational database an assembler an interpreter for a toy computer a graph Drawing language and a recursive descent parser Okay, so I had to do a lot of work with some elf files, and that was me. I'm like god. I'm really sick of like read-off or obj dump and While those tools stood the test of time and I can get the information I needed I was kind of seeing it felt like a mismatched pairing about techniques that like our best practice in the software industry And what we're currently doing today So the the questioning led me to explore the union of traditional data management principles and binary formats sequels the declarative language lingua franca for databases and its value propositions pretty well understood like It decouples understanding how the data is laid out and you can just to declaratively ask what you want There's an engine it processes that information and gives you the information you seek And it leaves the database to apply any optimization techniques when appropriate this is oh, yeah I Everyone's mentioning how they have to include AI in their talks So this was me asking some LLM to generate a logo depicting the Linux penguin and sequel light and I had to ask it many different times and this was Pretty good actually so yeah Go AI I guess okay, so to kind of like test my hypothesis. I said can I model the schema of elf in a relational model? This is the most basic schema that I could do at like to start to port the L file to the relational model I Mentioned earlier that elf was effectively a series of continuous code or data blocks and that's what I have It's nothing like too crazy. It's also not very helpful, but you got your header You got a content blob and a content blob and you have many of them That's what elf is and that's kind of what it gives you and then those contents have more meaning depending on the name And there's a lot of specialization there But like as a starting point, they're I some they're isomorphic I'll explain what this is. You don't have to read it. It's okay. It's not super legible, but I wanted to show I it's like oddly an unexplored space This is a Google search where I just wrote the term sequel and elf I have some generated content up there some blog posts and a repo that I'll get into in a section In a second those are the only like top hits There's a couple more that I that are there, but they're actually about sequel server by Microsoft Getting supported on Linux so they talked about sequel server ported to the elf file. So it came up with sequel and elf But other than that nothing So when nothing comes up for an idea you're pursuing I don't know you're either on to something very wrong or very right and it just had me thinking about it Okay, just to cover my bases. I also asked chat GPT 4.0. Okay, so it's the paid model Have you heard a good about a good tool to do this to combine sequel and elf? It came up with nothing and at best just kept recommending read elf and obj dump So AI doesn't know more about this than the rest of us. So for now our jobs are safe Okay, so this is the repo. Oh god I'll talk about it a little bit more, but I started to explore this in a utility It's called sequel elf It's on github. It's an MIT license. So it's a open source It has a much more complete relational schema for elf which will be on the next slide I could talk about that a little bit and you can query elf objects through sequel It's pip installable at the moment It's written in Python Which is a good thing and a bad thing because it's a little bit slow for very very large binaries But those are more at the like tail end of some quantiles that I don't rarely see And it only supports reading at the moment. I have a plan to try to think through updates The it's a code base. Yeah, and shout out to my buddy mark right there who helped Contribute on the project. So the tool offers a schema. It's pretty expensive. It won't fit in one slide I tried to show just a part of it. It's backed by sequel light The schema here is a partial representation like I said and some of the nodal parts missing are I support for real locations and Dwarf itself is like a whole other huge mapping of tables that's in When you build with debug symbols on Right now I did a pretty one-to-one mapping for many of the sections into tables With some minor abstractions to make it a bit easier when you're writing sequel Which I actually find to be a benefit of sequel so if a particular section is difficult to introspect sequel lets you just create a logical view Make the the better abstraction you want to work with and you jump off from there. I was able if I found an abstraction Common enough. I kept needing so one of them was the string table. I wanted to know the list of strings in my L file Well, I would I just made a new view with a new table that how all the list of strings present So yeah, it's been a pretty fruitful schema and I don't think it's gonna, you know It's it's evolving and I'm kind of pulling in more as I discover constant Queries I want to do often Well wish I knew was on that slide, okay How legible is that I can walk I'm gonna walk through the code so you don't have to super read it They're just examples and I'm gonna talk about some interesting ones The rest of this talk now is like five or six Anecdotes of like ways I've been using the tool when I'd make you appeal to your senses why you might want to use the tool so at the top is Getting the Above is the relatively simple query to assess whether there are any symbols in the whole dependency closure of the application Which may have been shadowed or interposed in the which is the word used in the compiler world So a shadowed symbol is one where the there's a symbol exported by two libraries Most often this is a bug unless you're doing LD preload where you're very very explicit about it If you've loaded two symbols, sorry two libraries and they export the same symbol like you're gonna pick one and the one you Pick isn't probably the one you intended Distributions check for this because again They have a single package set and they need to make sure no symbols interfere You could be library a and library b and you don't know about each other's code And if you just happen to pick the same symbol and later on someone decides to link you both they're gonna have a bad time so maintainers just like manually audit this and Maybe surprisingly unsurprisingly they write off ox scripts to just like comb through this and and test it out. I Was able to write a sequel query at the moment It's checking it on a list of binaries, but I'll talk later. You can run this against the whole distribution so you can encode the question you want as a declarative language question and Check the result on The bottom left is a sequel query from work Where I work I did replacing sequel elf In the audit wheel package so audit wheel what the Python's packaging story is kind of crazy audit wheels part of Python solution to the problem of how do we distribute compiled code? From one machine to another effectively. There's an RFC The community audit a base set of libraries and symbols present on a commonly used Linux system and deem those dependencies to be present always Any other disturb any other dependency must be included in the Python distribution format itself, which they call a wheel By not distributing every dependency and relying on some symbols to be present on the machine itself they then like Deem your package to work on most Linux's and in Python they call that many Linux so you get like this fancy tag in Any case parts of the code I want it we wanted to understand that I was looking at if the shared object was a CPython extension And whether it was for Python 2 or 3? So they had like a pretty expensive code block that used Whatever Python's equivalent to reading elf and they would go through the symbol table and check for this some symbol and see Does it match a or b just to get the answer is this extension of? Python to see Python 2 extension CPython 3 I thought it was nice. I could rewrite that whole thing in a pretty terse SQL statement I could run it on my command line. I could also there's a Python API and I changed in an auto wheel and I Found it more legible on the bottom right is the use case that was offered to me from one of my colleagues at Google They had some tooling that effectively broke when they assumed an invariant invariant that no two symbols could occupy the same address space With an elf file They had to write custom code to check this like they've hit it once and then it they had to write a custom binary That would check all their code, you know like with a declarative language a new issue comes up It's a bit more meta. You just write the query you want and then you add it and you can check your binaries And and that's an equivalent there One final example is a quick way to find the instructions associated with the given symbol So I have an instruction table. I added for fun You could even like because it's all lazy sequel the columns you want I could disassemble I don't know if I show that I do show the disassembly as well so you could just add the disassembler similar to how kind of obj dump and read-off can do and They'll you only pay for that cost when you see that column and Yeah, so on the left I use the symbol table itself and on the right as a demonstration of the tool beginning to use some the Dwarfs tables on the right is like me looking at functions in dwarf and calculating their size and Yeah, okay, so this is me showing I mostly sequel off works. Oh, sorry. Yeah, go ahead. I didn't see him That's That's one mode. I'll talk in a second. There's a way like that that's pretty potentially expensive and you might want to Memoize the result. So it has a mode where it effectively, you know dumps it out and then it's much faster and I have a Shitty benchmark that I'll show that shows it's pretty fast But yeah, it uses sequelize virtual table to start off Functionality which is a virtual schema so you could lay Metadata schema schema on top of elf files and access it that way Cool, so sequel doesn't have to work on one file Sequel elf, you know read elf works on one and that's a pretty like limitation the relational model is all about tuples So that so now we could think about data in terms of that format The schema easily can handle multiple binaries because I've had this extra column of the path of where the binaries from This next now lets you think about doing aggregate analysis Like read off is only like let me read this one particular file and I could think about doing analysis across your whole system Because that's what the relational model gives us So maybe I wanted to know what is the most used symbol amongst a set of read-off files like you know I want to do some I want to think about Some sort of like dead code elimination. Oh wow like in my distribution. No one's ever calling this library Maybe I should think about removing it. No one's using these symbols like so you can kind of do this after the fact ad hoc What's the most used shared object and so forth? also like I had this insight AR the Linux format for it just packages a bunch of like Object files when you compile it's kind of similar like there's a lot of overlap now where You can have a schema where you're packing in all these different elf Objects and the AR format's doing that. So yeah, I'll touch on that a little more. Yeah, go ahead Yeah, so the question is in an embedded world. Would this help me and analyze my symbols? being used and figure out which ones to Shut out or strip out if I'd like like to I'd say yes because an elf file has to say which symbols are exported and The another off shared object file or binary has to say which ones they're importing So you could get an exhaustive SQL query and say is anyone out of all these binaries did anyone import this symbol if not Then you know at least it doesn't need to export it. You'd have to know if it's being used within itself and then yeah But you could remove them remove it from the export set Which really cuts down on the surface at least of what people can link against which is like the first step Yeah, go ahead. Do you see this as purely like an introspection tool? Or do you see like as it like sequels a way of exposing like an alternative like binary format that like an interpreter could actually Yeah, okay, so the question. I guess that should be that question. I'll repeat anyways. Do I see this? Maybe being purely an introspection tool or replacing elf itself. I have dreams of the ladder this isn't I haven't implemented that yet, but Like it begs the question do we need elf? You know and I'll talk a little bit like there's a need for the performance at angle of it And if we can figure out that trade-off, then maybe we don't need it as well as much And then that's kind of where I was getting it here Ar like there's parts of the tool chain that can collapse as soon as you're working in the relational model It's like do you need ar ar is also just a sequel light table effectively with many binaries in it instead of one Okay, I think two more examples This is on the top Wait, what's on the bottom? Just dwarf examples. I covered it enough. It's just like you can go It's like limitless. So I try not to put too much. I just kind of wanted to show some more interesting ones That one is like I have the relocations table is on the top and on the bottom is a an interesting like static analysis I wanted to see line count so dwarf has function symbols and it maps it to source code lines So you can kind of do what's the largest program? What's the largest functions amongst a series of binaries or in your binary so you can get like static analysis in sequel that's kind of neat like after the fact like as long as you compile the debug symbols on Okay, so how does this approach like in practice for performance? Well, since the space of possible queries is effectively limitless and sequel off can allow arbitrary complex queries It was pretty tough to do a performance benchmark against. I don't know read-off, but I chose Basically a straightforward example is a proxy and compared it. We're just doing a single scan. So I have I Chose just list all the symbols like so just one scan query That was kind of similar to read-off on the left is a graph where a sequel off is much slower So it's the green line the red line is read-off But interestingly you can Memo eyes so I can take that virtual schema and say well now dump like dump it persist it to disk and Then I could access I could use a C client now the sequel I'd see quite client I don't even have to use my sequel elf tool. I could use the ecosystem of sequel light which I'll talk to in a second to access this data and Maybe I'm surprising that wicked fast Like databases are fast. There's like the 1 billion row challenge and People are writing like crazy optimized code and they're still hitting the same as an engine. It's like the fighter jet So and it was pretty easy to persist the data You can use a C task which is a crate table select to effectively effectively memoize the data and It helps me avoid the pitfalls of having it written in Python now how problematic are elf files with symbols on The order of a hundred thousand it looks because that's when the graph really kind of spikes up for my sequel I did another audit of a distribution and found I think there was you know I normalized it a very small amount have over a hundred thousand files So like on in practice Linux distributions, it's gonna work surprisingly well where it kind of shit the bed was I tried applying it to some binaries produced at Google And those were on the work. They were like gigs huge size files and it didn't work very well there, but You know for most of us it's gonna be fine for now and you know if I need to we could rewrite it and Some more performant language Okay, I built another tool part of it just like layering the thing where you give it a docker image and it produces For that distribution the full final sequel elf metadata that you can explore So you run our individual images I ran it on Debbie and stable and then Debbie and Buster takes it's pretty quick So that's now a single sequel light file that has every binary as part of that docker image You can then I don't show that command But when you file a sequel light you can attach multiple files and they become Like effectively different namespaces and that secondary query you can kind of see The two different from statements the Debian stable or Debian Buster So I'm asking the question here, you know, part of the problems you have to come up with interesting questions So that sometimes I'm you know, that's like step two, but this interesting question was Which g-lib C's aren't present in both which g-lib C version Isn't present in both distributions So that meant if you built your software using some of these symbols with this version You can't work it won't work across both Debian distributions. I thought that was a pretty interesting quick Example to audit. I'm a library author out like What distributions do I know I'm gonna work on and this is like one way of discovering? Well, those distributions have no symbols like that's their intersection. This is another tool It's written by Simon Wilson. It's called data set you give it sequel light files And it gives you a really nice visualization to explore the metadata This was me uploading one of those Debian ones to his tool. It has a lot. You should really check out the project It's very it's really interesting. It's kind of meant for journalists to explore data and build websites But I have the showcase here is like sequel lights a huge ecosystem itself So we get to tap into that instead of having an art, you know I was gonna use the word archaic but I'd say neat more and it's not niche because it's all of Linux is elf but it's definitely a segmented ecosystem and Coupling with that sequel light gives you access to a ton of a ton more tools And this was an interesting one too. It's like well. I could just throw a website up. I Deployed it with some cloud provider and you can give it different sequel light files and instantly start visualizing and Navigating with a UI if writing sequels not your thing Visualizing the data set Yeah, I think someone asked a pretty poignant question. It's like where was I really going with this? I think the kernel Seed idea is I'm finding I'm jumping through hoops to replicate and Do what elf does and sequel light kind of does it well and things need to access what all that elf metadata The loader compilers linkers And they're all doing that on their own Individually in different ways and and and trying to counteract Performance by adding date like optimization did data structures like they're building their own engines Individually to do something like what a database would do so is there you know So this is like this sequel up to me is like a little seed and I think there's this like bigger broader idea out there also to pursue All right, thank you for the talk. I'm happy to take a few more questions Green shirt. Oh, it's query or one of the other Yeah, I'm kind of investigating right now So the questions looks like it has some good applications for malware analysis. Have you looked into integrating with OS query? Definitely, there's a lot of synergy between OS query and this OS query is trying to also Put a sequel abstraction onto the rest of your system And I think what could it gives you a nice temporal analysis like sequel elf right now is giving you in the moment view of a static file What's there and I you know, I'm looking to do it with OS query to see Well, you could kind of get this analysis over time And I think that would be that the dimension OS query gives you so yeah, and I think it The more you augment the schema, I think more malware detection analysis could be possible specifically with the instructions so right now it's Just a pretty simple dump of the each row is just one instruction. Yeah You said I got the mic I got the conch You said that the different binary formats are kind of similar if you squint yeah Are they similar enough where you could abstract them in the sequel? Schema and then present like a similar interface for you know other operating systems and make it a general purpose tool For everybody. Is that interesting cool? Yeah, I think so. I mean they all need symbols. I mean I've looked at we've written a little bit about the other Formats specifically Mako and PE and they all need to encode they all have code blocks They all have to define what symbols they need and where to find and what other shared libraries That's pretty much it And Someone's already forked this code and added support he added a few extra tables and he added Mako support It's not unified So it's like there's the elf tables and then there's the Mako tables, but the setup had lend itself It was a very small fork to add But definitely I think there's like a unifications right definitely some of the tables Part of this is to let you explore elf Specifics so you kind of want the specific schema for some of the elf Specific sections, but yeah Actually, I'm on the same lines. I was wondering have you looked at rather than using SQLite using Postgres as the underlying data store and query engine Yeah, I I kind of knew I wanted to head in the direction of maybe replacing elf The I think the interesting analogy more would be there's DB OS which is trying to replace all of the database with kind of Sorry, the operating system of the database access layer. So there's more synergy there, but to start I had this breadcrumb idea Let me lay on SQL on top of elf and SQLite It's just a little more simpler to work with and then it has a file So the analogy is like you had a file before you still have a file if I can execute it And I've just given you SQL introspection, you know, yeah But yeah, thank you for the comment. This question is a bit out of my depth, but SQL I think has roots in data log Which has like more expressive Query like syntax especially like recursive queries on datasets. Did you do any investigation about? implementing like a query Elf feature like using data log instead of SQL itself like Performance is a concern. What's sorry, what's it a log? I'm not familiar with it. Oh It's in SQLite. Is that we said no data log is like It's a query language that predates SQL and like sort of was its progenitor and then see It's like, yeah, it hasn't that prolog and data log Yeah, I just want to know if like SQL was a constraint for some reason because it's just ubiquitous or if it was It's more of the ubiquitousness of it and the lingua franca So yeah, I want to make the tool available to like many practitioners and there's again It's that ecosystem like I also chose SQLite. It's got The ability to share a file is really important and then tap into all the tools to access it and I didn't cover it in this slide But SQLite They publish is the most widely used software on the world. It's on every phone multiple times and And It's on your computer many times so they have it as like it's on the order of trillion or something about installable Which is pretty insane Yeah, I would assume that there's a lot of opportunity for like the on-disk format to be like efficient for like memory mapping and stuff to that regard I was wondering if you've been following the TVX Next re-implementation which has like a would store all binaries and a fuse backed storage and then I was thinking like that could provide a Like native view SQLite view even if the underlying storage was perhaps more optimal than the standard SQLite database Yeah, I've been following TVX, which is the next Re-implementation not that closely following it just more like when is it done and Yours what was the first question? Oh? It was more a common just like thinking like It could be go beyond SQLite like the native storage format which might not be optimized, you know totally into Perhaps having a better on-disk representation, but then providing different views into binaries whether it's being used as a executable or being loaded as a Yeah, I'm looking to just prove I'm suspending the requirement for performance right now, which I know could be a limiting factor for many but as Kind of like a research student I'm allowed to do that So I'm saying what if we didn't take that trade off of always picking Performance which always comes with the constraint of like terse and all that stuff what do we get out of it and then if there's a benefit there try to like add back performance or say There's probably cases where it's still You know you could relax the requirement for performance and I thought through a few ways to add performance You could always like have an index where you still have the memory like regions contiguous and map them in And still have that those sections to explore Did you have a question I'll take maybe it's the last one so To return to the original problem statement people shell out to read elf because parsing elf is painful But you're now shelling out to SQL elf Even though you do not love shared object libraries have you thought of You know publishing an API so that people could reuse this approach Without rating get another shell script to that calls a SQL elf totally So it's a Python library So it's an application as well as a Python library. So if your code's written in Python today, you can use it in your code base fairly trivially and if you can't if you're not written in Python you can at the moment do this step Memorizing it into a SQLite database and I think every language at the moment has a SQLite client that you could use and access the data that way So there's definitely very Good integrations possible without having to shell out. But what I liked is then you can always take that code block and Kind of be like what it's not working I don't know if you guys all been there and like you don't want to mental map it to something else and it's great You know, you're not in the same language, but it is nice to be able to copy that block on your terminal Kind of continue your investigation and then go back. So it's definitely a little bit of trade-off But yeah integrations with languages very possible right now. All right. Thank you everyone Right. Whoa, that's cool. Ah Hi everyone. Thank you for coming to my talk today. I'm sweet tea. I work at meta Primarily on the butterfess file system for the past two years before that I was at red hat. I Really enjoy working on the Linux kernel and today I want to tell you all about encrypted butterfess sub volumes, which is a project that I have spent a bit of my time at meta on It's one of the most exciting new features in butterfess in my humble opinion in several years And I'm really excited that all of you felt like coming into hearing about it So to as an overview first, I'll be talking a little bit about butterfess and where Encrypted butterfess sub volumes actually are then I'll be talking a bunch about the technical aspects of the work talking about the differences between different file systems and built-in disk encryption future goals for the work and then Practical usage of the encryption encrypted sub volumes. Oh, so yeah With that I'm curious how many of y'all use butterfess So about two-thirds very cool very cool and how many of y'all have contributed to the Linux kernel Cool, so about one-third very cool very cool awesome So a general sort of overview of butterfess Butterfess is in my humble opinion an advanced file system It's not necessarily a judgment on quality just a judgment on the number of features that it has So butterfess So a normal file system, you know, you've got a disk and you use it in your file system and it provides files and directories glorious worked well ten years ago still works pretty well for most use cases, but Advanced file systems have a couple of new features that can lead to data efficiency Or isolatability which older file systems don't have one of these features is called ref linking or reference linking It's where you take a chunk of data and you Tell the file system that it's actually part of two different files So it allows you to sort of de-duplicate your data Sub volumes are another great feature instead of having just one Disk that you mount and provides you with your whole file system with sub volumes You can choose which directory tree to mount at any given point. So for instance, you could have Every user on your machine can have a different user sub volume containing their personal user data and only that personal user data so that even without the Without file permissions that user can't read any other users data because that other users data isn't mounted at all just completely unavailable Along with sub volumes snapshots are a very cool feature snapshots allow you to take a sub volume and Grab a snapshot of how the files in that sub volume appear at any given moment in time So this is really great for backups. It allows you to take a Snapshot of exactly a point in time and save it elsewhere so you can easily roll back Package install or a change in your code that you turn out not to have went it after all So snapshots are pretty exciting But our fast also features check summing of your data, you know Discs are glorious and discs are very reliable, but at metascale sometimes discs go bad so We have check summing built into butter FS so that we notice when data on disk has gone bad And it's also useful for your own personal computers So that you know when to dig out the backups So overall I would call butter FS XFS and be cash FS the advanced file systems in the Linux kernel It's possible that I just don't know how wonderful and advanced other file systems are XFS doesn't have sub volumes and snapshots, but it's got ref linking which is pretty cool But our fs has all of these and my understanding is that be cash of us has all of these things too However for a long time butter FS has not had a good encryption story now We've got disk based encryption using device mapper using DM crypt Which is the underlying technology behind Lux And Lux is pretty standard on most distributions these days. However, there are When you're using butter FS on top of a Lux device it works great But it encrypts everything and because butter FS copies data every time you write There can be a little bit of extra IO which can slow you down a little bit um the way that Encryption for file systems is usually done in the Linux kernel Would have meant that we would need to give up ref linking and check summing in the past however as of a couple of years ago, we Started working on per sub volume encryption where each sub volume has its own encryption key Which also works particularly well for? Per user sub volumes That way every user can have their own key. It's pretty great Now at this point is the part where I awkwardly admit that when I submitted this talk It seemed like this work was about to go into the Linux kernel And it would be widely available and easy to use everywhere by the time I gave this talk six months later however How many of those of you who've worked with the Linux kernel before know that sometimes it can be a little difficult to Successfully come to a joint understanding and get your code reviewed and into the kernel So this is the part where I awkwardly say this stuff isn't actually in the kernel yet it's just a patch set on a mailing list of branch on github and It's not actually easy to use yet, but one more kernel release is what I've been saying for the past six months, so In any case There have there were some technical challenges to add encryption to butterfs that I think are particularly interesting and illustrate Some of the complexities of butterfs compared to other file systems, which maybe aren't Don't have those advanced file systems features that I was talking about So in in the Linux kernel, there's a sort of library Called fscript that provides encryption facilities to file systems It provides So there's an encryption area of the Linux kernel and then fscript is the glue that Connects your file system with all the different options available for encryption in the kernel It makes it easy to use It's commonly used on Android as I understand it so Probably at least half of you are currently using fscript in your pockets X4 f2fs f and ubifs are currently integrated currently have Integrated fscript so they all support encryption with this interface but fscript has been around for a while and so it's oriented at File systems that don't have those advanced features I was talking about So they have one key per directory tree and you can't have multiple keys in a directory tree This is a little awkward for butterfs given that you might want to have sub volumes inside of sub volumes inside of sub volumes Stack them all up Nest them all over But fscript doesn't currently support that Fscript also allows you to delete files without actually having the key This is a really important feature because otherwise you couldn't get your space back unless you reformatted your system Which I don't know about you, but I don't like reformatting my systems anybody anybody actually like reformatting of Course, there's some Kent anchors people in the world Keep calm and reinstall So the big advantage of using fscript instead of a lux disk encryption is that FScript allows you to only encrypt your personal data. So there are different threat models that you could treat That you could think about for your data, right? There's the threat model where everything about your data is important Even just the size of your file could give away some information about it and you know That's a perfectly valid way of thinking about it That's a perfectly valid that model to use and in that universe It is a good idea to use full disk encryption So you absolutely don't leak any information about your files at all however For me and for some other folks in the world perhaps even a lot of other folks in the world We really only care about our data being read. We don't really care about other people knowing how big our files are There can be a lot of files with the given size As long as you can't read the data or the file name. What's the harm? So fscript allows you FScript file systems just encrypt file names and data which allows for added Which reduces the amount of stuff that needs encryption, which means you spend less CPU on your encryption So I don't know how many of y'all have worked in the Linux kernel enough to know about inodes How many of you've heard of inodes? Okay, so about about three quarters very cool So the idea for fscript is that the in kernel inode structure embeds an fscript information Structure that gets stored on disk basically that information structure allows the fscript library to encrypt data in files based on where that data is in the file and Which file it's in now You can probably already see where this is going and why this Would have Prevented this would have forced us to give up for ref linking because if you've got data in two different files and It's encrypted With this scheme then on one side you're trying to decrypt it with this file info and on this file You're trying to encrypt it with this file info, which is different so Without making changes to fscript we couldn't Reflink encrypted data together, which would be a bit awkward Also as I mentioned butter fs likes to use nested sub volumes which makes It sort of important to have different keys at different points in the directory tree and Finally, it's difficult to do check something if the file system Never actually can access the encrypted data, which is the direction that fscript is moving in so this is a bit awkward and requires a bit of changes to fscript and Figuring out exactly what those changes to fscript should be has been a bit of a project The overall vision for butter fs encryption is that instead of Encrypting data based on what file it's in you encrypt based on what extent it's in now if you've worked with files Systems you might know that File systems usually store the information for a file in some actual Extent on the disk which serves as a sort of indirection layer To allow moving the data around or more efficiently packing the data So the idea that we came up with is to store the The fscript information alongside those extents instead of those inodes that way if you've got two files pointing at the same Extent by reflinking They both can successfully decrypt the data given that the key is the Additional information combined with the key in order to decrypt it is stored right there This does mean that it takes more metadata space than for older file systems But it means that we can reflink so by and large for butter fs. It's a win to do it this way We Sint up the first version of these packages with one design back all the way in October of 21 We we were using a Minimal set of data stored with each extent and it was pretty decent It had some security risks and so as it should have been it was odd I worked on this a lot and eventually we came to an understanding that the security holes are too bad And it was a bad idea in November 22. We sent out a second design. That was almost entirely That was largely my work Using trying to pull down the inode information directly on to the disk this was Elegant in some respects and in elegant in other respects and so the maintainer Decided that this approach wasn't working either and so in September of 23 Which was about when the call for proposals for this conference was due We came up with a third design my team lead Joseph Bacic and the maintainer worked together to come up with a design that everybody agreed on and And so my packages were amended greatly and new versions of the packs that were sent upstream Unfortunately, it means that we can't have nested sub volumes right now, but we can one day in the future The effort was to try to get it into the kernel sooner and then add this additional feature later and Unfortunately, even though it seemed like it was one release away for the past six months. It still hasn't gone upstream I'm still really excited about it and I'm still of confident that it will happen in the near future But it Hasn't happened yet. So the idea here is that we're storing sort of a Where we're storing both information along with the inode and also information down with the extent So The inode is storing what key we're using So if two inodes want to reflink together, they have to use the same key This does Provide some advantages when it comes to security So that two users who happen to have the same data don't end up using The two sub volumes that have the same key don't end up Leaking information if only one of them is mounted So It's got some merits. It does somewhat restrict Quite encryption options you can use And we're up to version 5 coming out soon on the mailing list and it still hasn't gotten in but hopefully when we're released I Evenkilly But this addresses all of the difficulties that I talked about it doesn't allow changing the keys yet And you might conceive of wanting to change the keys Let's say you have a subvolume That serves as the base for all of your user home directories you might want to take a snapshot of that and then give each user their own key and Either re-encrypt or use the old key to access the common data across all of the user directories Unfortunately, we don't have support for that yet, but it's possible to add it fairly easily We also came up with a way to Check some the encrypted data which is very important for better FS So that you can tell if your data's got encrypted on disk And by using all three elements to encrypt instead of just two elements it allows reflinking and will Eventually allow reflinking between Any number of i-nodes so it's got a lot of potential and I'm really Excited for it to go upstream soon and There's still some work to be done here Be cash FS is the wave of the future According to some people how many of y'all have played with be cash FS So about a quarter very cool Be cash FS Has some great features and I've really enjoyed playing with it. I really enjoy the there being more file systems with subvolume features in the kernel so that we can build common interfaces and Spread the love of subvolume swider But be cash FS doesn't actually use FS script right now it also only has one encryption key profile system and Everything is encrypted. You can't delete stuff if you lose the key, which is a little awkward However, there are some features that it has that FS script doesn't yet have FS script does not currently allow you to have authenticated encryption. So authenticated encryption is sort of like a more advanced form of check summing You take your data you take a key and you get out your encrypted data plus some authentication tag that You can only derive if you have the correct key So it leaks absolutely no information. I'll check some even if the could either be spoofed if it's a check some of encrypted data or You could Find out what that data is if it's a check sum of unencrypted data. So authenticated encryption Nicely solves these problems by only allowing you to verify the data's correctness if you already have the encryption key So this is a very cool feature. I think be cash FS is doing wonderful work in using it but it does mean there are fewer options for encryption algorithm and Different companies different countries have different requirements for encryption algorithm So there are some there's some trade-offs between be cash FS and better FS Another Option that you've probably heard of that I talked a little bit about was Lux. So Lux does allow Authenticated encryption also if you use the DM integrity target also It still only allows one encryption key per file system or block device. Obviously everything's encrypted Which again means you can't delete stuff if you don't have the key But it does allow you to do authenticated encryption As I mentioned and it allows you to change the encryption key, which is a really cool feature It means that if you lose your key you Can say this key is no longer valid and not lose all of your data You can also be encrypt with some newer and fence your encryption algorithm if you're existing encryption algorithm Turns out to be insecure now. I believe Lux is yes. I What um I Believe the technology is technology So technology is pretty cool Very exciting So how many of y'all have had an exciting experience with displays on your computers Cool and how many of you have deleted something that you didn't want to delete and How many of you have reformatted a disk that you did it mean to reformat? Good times good times Have any of y'all actually used the Lux option to re-encrypt your data Very cool very cool. It's a lot more than I expected very cool very cool and How often have you changed your keys? Yeah, it is a bit of a pain right you either need to keep both keys around or you need to Spend the CPU times and the IO time to read all your data and re-encrypt it. So it is a bit of a pain technology Different technology, so it goes Yes So the question is does this have anything along the lines of access control lists? So my understanding is that ACLs are more or less distinct. So this revolves around Having an entire file system directory that is only visible to a User if they successfully load their key So when they load their key, however, the entire set of file system features is available So if you've got your ACLs stored as you normally would with the file system Those would be available now that you've decrypted it. Yes So that's so the scenario being described is basically every combination of two of Some number of three users wanting to access different files being able to access different files So there are some files that only a and b should be able to see some files only b and c should be able to see some only a and c I see Definitely definitely so the idea is that pharmaceutical companies sometimes collaborate and sometimes don't want to collaborate And we don't want to leak information between Companies that don't want to collaborate Indeed indeed. So Unfortunately, I can't come I have not thought about this very much You would need eight partitions with Luxe with a butterfuss based approach. You could still have eight sub volumes for every combination here That is a good point so While I don't have a good solution other than a sub volume for every pair that went to every set that went to collaborate With a Luxe based partitioning scheme You have to decide how much space you're going to use for each of those partitions at formatting time with a butterfuss Sub-volume everything is being stored in the same set of backing discs So it's a lot more flexible and if you delete some space from one shared sub volume Now all the rest of the sub volumes can use that newly freed up space. Oh I can my understanding is that there's some Technology to take one single encryption key and split it up into many encryption keys I'm not a cryptographer. So don't ask me how but Such a technology could easily be layered on top of using butterfuss sub volumes To provide exactly the sort of shared access controls that you're talking about. I believe When one key unlocks all the sub volumes that you're supposed to have access to each with their own different key and that I believe allows the sort of Some subsets of Users can access each different sub volume Yep, it is the case that right now you can use normal file system partitions and that's a Reasonable solution in the current world, but definitely Encrypted sub volumes could be used in this fashion also So it looks like my technology is properly technology now so I will Charge forward with the rest of my slides Yeah, so we were talking about Lux it's great it encrypts everything Sometimes you don't need everything encrypted sometimes you care about performance more than you care about encrypting everything So that's why FS script might be a different choice for different Applications than Lux So we talked about those two options and how Key changes is a very cool feature and I definitely hope to use it Hope to integrate that into a butterfuss encryption one day soon So there's a fedora proposal to maybe use butterfuss encryption one day. The idea is that Some laptop OEMs have requested a way to load a encrypted image onto a disc that they're For a laptop that they're sending to a company common encrypted disk that way everything is encrypted then the company can set a new key for their own personal packages that they want to install on the computer and Then sets up their home directory templates and then the user can set their own new key for the home directory Possibly with some sort of escrow so that both the company and the user can read the user's data because users Sometimes lose their keys and really didn't mean to It's probably a bad idea if your user loses their key and loses their past three months of work stored on their computer So this directly requires key changing you start with an encrypted image You probably want to change the key on that encrypted image on a per company basis Then the company wants to change that so the OEM can't access it Then the user wants to change their key so that anybody else from the same company Can't access their data without having the full company key, etc The only bit of meta information in this talk is that we once had a vision where we would do something similar And we may want to do this one day in the future where we install an unencrypted package into a sub volume We usually run surfaces inside of a container inside of a sub volume for isolation purposes So if we install an unencrypted set of binaries Then we could change the key on the sub volume so that everything written by that surface To disk is encrypted. No chance of somebody forgetting to call the encryption service before writing to disk However, we've decided to go with a different solution for now, but there but our fast encryption is a Feature that has wide applicability and we may find an application for it in the future or change technologies And finally there are some companies that just believe in changing your password every so often And this would allow you to change the password our key on your sub volume. Well still Which can't currently be done with the about our fast encryption proposal as it is now So that's like the technical overview of how this is all put together and why we care about it Now we come to the part where I really wanted to have a demo, but I I'm a Running Mac OS and my virtual machines broke with an update this morning. So that's a bit awkward But I Mean it's what I get for using a sex for kernel development, right? Also true So This is if you download the patch set off of mailing list build the kernel and Interact with the keys Directly with the FS crypts tool you can set up encrypted sub volumes right now Obviously, it's still the early days. So if you're an early adopter, we appreciate your efforts and hope that you don't lose any data There there is definitely a Stabilization period while we don't know of any bugs You know, everybody knows how software without bugs Doesn't exist. So Yeah But it whoops it shouldn't Once it lands upstream in the kernel It should just be a matter of installing the new kernel Installing the FS crypts tool that interacts with file systems And then a very small amount of changes to your system D services in order to ask In order to make it ask for a butter FS key instead of just a locks key and that will load it into the kernel key ring and Then butter FS will unlock the master the like root sub volume Assuming it's encrypted with that key and then it should be easy to integrate Potentially with system D home there or some other Login surface in order to unlock your home directory as part of logging in So once this actually lands in the kernel, there's very little user space changes needed to use it And I definitely have a hope that one day all of the Fedora users in the room will be able to pick between Lux or butter a fast encryption when you install or perhaps be cash a fast encryption. They all have their merits So, yeah, how many of y'all actually use Fedora? Okay, a lot of Debian nubuntu users a fair number a fair number Sousa now, I'm really curious about what the rest of you use Of course Of course next OS is also an excellent choice. Um, so yeah It's true and I I really like next so I don't know why I didn't think of it Cool, so unfortunately that wraps up my talk about upcoming real soon now features in better FS Thank you And we have a bit of time for questions, so I see a question back there So my understanding is that there's a big limitation with like FS script type things where the metadata is remains unencrypted Has it been had to read any thoughts to Adding in something like compose a fast or Euro fast to maintain the metadata So you can have both encrypted metadata as well as per file encryption. Oh, so You're asking about the possibility of encrypting metadata Sort of like be cash of us does So that's quite possible For butterfuss specifically it's very difficult Because we store all of the file systems metadata in a single metadata tree So it's not obvious what key you would use to encrypt each Files each block of data in that tree if you tried to encrypt each individual File metadata item Those items are so small that there's too much there could be too much of a risk of brute force Attacks on your key, I believe I could imagine a world where you Used an overlay FS type thing and stored metadata stored all the changes To your sub volume in a different encrypted File on butterfuss, and I suppose that could work. Yep. So that's a good idea. It should be easy to Integrate above the butterfuss sub volumes encryption support once it lands Yes, yeah, you you mentioned there was a restriction on having nested sub volumes with different keys Does that that doesn't prevent me from having two sub volumes and just mounting them in In apparent directory relationship in a regular file system hierarchy What like so what what were the implications of not having this restriction like what does that prevent us from being able to do? Right, so for nested sub volumes the idea is that you have a directory tree and then at some point in the directory tree You have another sub volume that looks like a directory So For FS script right now you can perfectly well have two different directory trees That don't overlap at all that each have their own encryption key. However if you Had your one directory tree and then you tried to access The new sub volume that looked like a normal directory the keys would mismatch and you would get to error You'd have to mount them separately in different parts of your file system hierarchy. So hopefully we can Loosen that restriction at some point. It shouldn't be very technically difficult to do that I just we haven't done it yet Question behind you my concern is ransomware. I missed a little bit of the talk at the beginning So we were hit for ransomware and my company I work paid for it. Okay So where everybody went crazy and it's been a crazy year securing everything so the question is Can this be done so I understand that this is done at the volume level Or a sub volumes level and I'm not an expert on bt-rfs. I run everything on zfs, but all my servers run bt-rfs So the question is whether the application itself is the only Entity that can decrypt the decrypt the data not the user. I see so it so the only Entity capable of decrypting or increase if we if we get hacked this Entity should not be able to encrypt my disk basically or But I guess if he has access he can do anything he wants, but I don't know in the ransomware scenario How can this help? I unfortunately don't think I follow your question in enough details to figure out a solution for it Right now. Let's connect offline. Sorry The exactly. Yeah, I see you know your point. Your point is right. It's very well taken. There is no way Can group owned volumes be opened and unlocked by multiple users? Yes, if you use some sort of key distribution scheme or a key derivation scheme so that multiple users all Have the same key they can all unlock a shared sub volume using that key. I Think my answer there is a derivation of your questions. So, thank you very much Are there more questions? So I think all of y'all for coming to my talk. I I'm sorry. I don't have actual packages and fedora repositories to tell you to install right now to get encryption But I'm excited for that day to happen real soon now, and I hope all of you try out better I've had encryption one day and Thank you for learning about it today. We're 20 minutes early. So yeah Can folks hear me? All right. We'll start in like another minute or so Just not obstacle. All right. I think I'm gonna get started Welcome folks to my talk appreciate you all being here You know pretty pretty nice audience for a cold and windowless room to present in and so Thanks for being here. I'm Dan Schatzberg. I work at meta work on the containers team but I also do a lot of work associated with resource isolation resource efficiency and Topic today is gonna be skedexed, which is the extensible scheduler class. It's a way to write schedulers for Linux I'm gonna get into background motivation, but before I get started like How many people have Written BPF how many people have Written their own scheduler or edited the Linux kernel scheduler Smaller set of people I'm going to cover both topics And and how they combine together to make it much easier to change the Linux scheduler It's something we care a lot about Very quickly, I'll cover kind of what is the CPU scheduler? What does it do? Very classically you have a single core. You may have multiple tasks in Linux terms I may use the term thread pretty interchangeably with tasks, but Means the same thing right some entity that I want to schedule And the scheduler's job is just sharing the core. You have one physical core You have multiple things that want to execute as if they are running on their own hardware And so the scheduler is deciding Which tasks to run next where they run all that sort of stuff Obviously this simple model has changed a lot in the last, you know, 20 plus years We have multiple cores, but that's not really that much more trouble. You have Okay, multiple tasks. You may have to move them from core to core depending on whatever's idle This is again the kind of core functionality of the scheduler Obviously, there's some complexity with moving stuff if you have caches You don't always want to move a task around from core to core because some of its data may be hot in the cache So you want to keep it on the same core. So there's all sorts of penalties for migrating things And this is just the like Very very, you know scratching the surface of all the various challenges that are involved in scheduling Quite a lot of different problems you care about the schedule. One is fairness. How do I make sure that each task gets the amount it should? various properties of optimization I want to make sure to Use the system resources pretty well Make sure the scheduler itself isn't taking too much CPU time and taking away from other stuff And I wanted to work on various different architectures across different workloads for for all of these different purposes And this is just a small subset right like power management may be a concern of your scheduler And you would want it to behave differently thermal management is another area That I haven't listed here schedulers kind of in some way or another involved in all of these sorts of things There are a lot of different constraints depending on the environment you run on CFS the slides may be a little outdated in a sense But the CFS has been the kind of default kernel schedulers the completely fair scheduler for many years now In the last year it got replaced by EEVDF which works similarly There are some differences, but I will largely talk about CFS is that's kind of Been around for a lot longer in Linux And this is the the kind of default scheduling class in Linux and It's called the off a fair-weighted virtual time scheduler, which is a wordy way to describe Something that tries to preserve fairness basically and like largely what it does it figures out for each task on a core How much CPU it It's gotten and scales that proportionally by the amount of weight So if you set niceness on a thread that is changing its weight in some way And it's able to scale like okay if you're supposed to have a high weight then you get more time on the core and CFS basically just keeps track of how much time it's given everything and Make sure that each one gets his proportional share So showing three different examples here all with equal weights if you have one task It gets 100% of the CPU two tasks 50% and this is kind of just the abstraction right obviously at any one time a course running a single task CFS is sitting there making sure to you know give everyone switch switch threads frequently enough to give everyone their fair share It's been around since 2007 so we're very close to the 17 year anniversary of CFS being in the Linux kernel As I said over the last year there's been quite a substantial change here to move to EEVDF, but Both schedules are fairly have a lot of conceptual similarities right there's effectively a scheduler per core some load balancing on top And and using weighted virtual time as a mechanism I Think the the like history here is pretty important I think because it like dictates a lot about the structure of CFS and Linux scheduling in general So this is I think like a 2007 era top-of-the-line server CPU you might see and all of its two cores and Well, you notice about okay. There's just two cores one L3 cache and The topology is pretty homogenous. You just have the two cores, right? There's no new more properties here necessarily The migration latency is the cost of communicating between two cores like fairly high at this time so a lot of what scheduling was developed around was keeping everything local right like stuff stuff should run on a single core and stay on that core and You know periodically you can do load balancing to make sure your system is well utilized, but Not a lot of thought beyond that Nowadays you have much more complex hardware Topologies, I'll use this is I think AMD terminologies core complexes and core complex dies But it's kind of chiplet architecture has become common across a number of different hard hardware companies Which is to say that you have you know a far more complicated topology than just number of cores sharing a cache You have potentially multiple caches on a chip and potentially multiple chips and enuma system Which creates some some a lot more complexity here, so Kind of now the way I typically view it is you have you know heterogeneity across the system different cores having different latency to other cores and Depending on which cores and where those are some things may be faster slower than in the past so you have on a die you may have certain Memory access uniformity the cache through core complexes It may be faster or slower depending on exact number here So these are two fairly recent AMD micro architectures. I know the picture is Probably too small to see any details, but Zen 2 and Zen 3 if you want to dig into details here So Highlight all of that to say like scheduling is you know The hardware has changed that puts a lot of pressure on our schedulers to like perform Well and in this kind of shifting landscape and one place that we struggle a lot with CFS is that it's really difficult to experiment Right like I run a lot of production workloads at Meta and I want to see what what if I change this property of the scheduler What's going to happen? It's hard right after like recompile my kernel after reboot Once the machines reboot may take a while before the services back up and running to its kind of ideal performance and Just a high iteration cycle for any of that Makes experimentation really difficult to change CFS and the kernel It's also just generalizable right like the same scheduler needs to work for my Android phone my laptop and my data center and that you know Sacrifices performance, you know in the name of generalizability And of course just the upstream process of working on on the scheduler is challenging You can't regress it across all these different environments is an understandably high bar for contributions and What it results in is you know a lot of different Scheduler scheduler patches end up being maintained out of tree. I know at Meta We we try to minimize the amount of out-of-tree patches we have for the scheduler, but nonetheless if you know we can save 1% Performance on a workload we're gonna find a way to do that so that ends up having to Be a maintenance cost and something that's difficult for us to often share with others I'm gonna very quickly touch on BPF. This is like not a Not really a BPF talk, but I will If you have questions feel free to stop me at any point and just let me let me know So BPF like at its core is like a kernel feature that allows you to run custom code That you've developed in in the kernel and injected at runtime The very early days it was like a generalizable way BPF stands for Berkeley packet filter. It was a way to do packet filtering and Custom logic that said okay, if there's this field in the packet then do this or something like that over the last Several years it has evolved into a much much larger subsystem You can use it for tracing and all sorts of things by sort of attaching to different functions in the kernel and running Arbitrary logic and figuring out There's some demos earlier today showing a bunch of BPF, but at its core you like write C code You compile that into BPF byte code User space loads that into the kernel and then the kernel is gonna verify that your code has a lot of properties that necessarily The the C compiler didn't catch right that you aren't accessing invalid memory that you eventually terminate Bunch of different properties so that you know with this code running in the kernel It's not gonna crash the whole system but at a high level it's like a jit in the compiler you're able to you know run arbitrary code inside the kernel and I'm getting to what skedex actually is I realize it's a little long in the talk to to get to the kind of punchline here But um skedex is really like allowing you to implement Scheduling policies right so what is the actual logic of your scheduler and implementing those as BPF programs that you can load at Runtime you can change them reload them. You don't need to reboot your kernel There's a whole lot of safety built into the system here. So if you load a bad scheduler policy and suddenly we see that say You know a thread is not getting scheduler scheduled or hasn't been able to run for a while The core core system behind it will kind of kick out your scheduler and go back to CFS or whatever the the default scheduler is in the system so The the iteration cycle is like much as you would run any system service or kind of user space Damon in the system you would now write your own scheduler and BPF load that with the Damon that that Injects the BPF code into the kernel to use a new new feature called struct ops And BPF and Now you have a new scheduler running and you can take over all all of what what Linux kernel scheduler is doing The really big piece for us is that allows us rapid experimentation So as I mentioned you don't need to do any reboot you just recompile your program and go This is a huge win for us trying to run across a bunch of machines if I need to reboot them all I don't even have to stop the running workload, right? I can just unload the scheduler. It'll fall back to CFS I get a new one. I will deploy it. The workload keeps running just as before DPI is quite nice. I know that's fairly subjective, but It's a lot simpler than the kind of the core Linux APIs around scheduling There's a lot of heavy lifting done for you already in the API's And as I mentioned you have a lot of safety here, right? So like BPF on the one hand make sure that you can't write a scheduler that like corrupts memory, which is nice And there's some watchdog that will disable your scheduler if you're not running some some tasks for Configurable amount of time and also there's just you know a key you can hit if you you need to kick your scheduler out If something isn't running well Another big win is that it moves a lot of the complexity into user space So load balancing is something like classically pretty complicated in the kernel and trying to figure out how much load is on each core How much load should I take away you have to play around with the fact that like using floating point is not something you Oftentimes have the ability to do in the kernel if you move all that logic into user space You can interact between user space and BPF to do load balancing a lot more flexibly Using standard debugging tools. I actually think this is a space the kernels gotten a lot better more recently There's this cool stuff like like BPF like drag in the ways to debug the kernel code That you couldn't in the past but it's really nice just to use like gdb right on a binary and debug how it works So it becomes a lot more easy to kind of figure out Okay, I need to add some more data to my scheduler have it X I want to know when it's doing this kind of event or something like that Have it bump a counter from user space. I can see that read it printed out those sorts of things come a lot easier And then a big thing is just like the ability to share scheduling logic, right? So as I said like if we have our own custom patch for for our scheduler And I want to share it, you know, okay. It only works with this version version of our kernel Can be a little tricky and and things change here There's the kind of core functionality we get out of skeptics is you have an API, right? It tells you what you need to implement to be a scheduler and you can share that Pretty easily people can load to experiment on their own machines pretty pretty easily All right next section I'm gonna get into like how to actually build schedulers of skedex I debated a little bit about like whether or not I should put code on slides And decided some of the code is short enough that it is worth putting on the slides to show exactly how short it is I made that decision before I saw how large the screen was so My apologies for people particularly in the back that you you may not be able to see things But you will get a sense of how large the code is I hope So very concretely like what do you do you you create a BPF program It fills out a struct with function pointers those function pointers are callbacks, right? So the kernel is going to call those functions at certain points during during the process of scheduling and You tell the kernel what to do, right? So those callbacks are things like task wake-up right like a task was blocked reading on a socket Let's say suddenly that socket has data. It is now Awoken what what do you want to do with it? What core should handle this wake-up and Q and DQ so when when a Task is runnable What do you want to do with it? And then all sorts of state changes of tasks you get notified of usually these are helpful for just You know accounting that you might want to do in the kernel know when something is runnable know when something's actually running And do all sorts of your own statistics inside your kernel There's C group integration is built in there's a bunch of callbacks. I won't cover everything here, but It's a fairly complete set of what you can do And then there's also like a bunch of fields you fill out in the struct that are not function pointers But they sort of define say the timeout threshold at which point your schedule will be evicted What the name of the scheduler is stuff like that? How visible is this just out of curiosity can people relatively far back see anything? Okay, cool All right, this this is more code than I wanted to show this is more of an eye chart But I just wanted to show like exactly what it looks like right there's three function pointers at the top one is Select CPU right that's you know a task is woken up, right? figure out what to do with this task in queue right this this task is is Ready to be run. What do you want to do with it? dispatches is a pretty common one The CPU has nothing to run. What do you want me to do right? and All of these have default implementations, so you don't need to necessarily implement all of these But any kind of sufficiently complicated schedule will probably implement these three Dispatch queues if there's like one thing you want to know about like how to actually use Schedex you need to understand dispatch queues is sort of the core data structure that that everything works So rather than you telling the kernel, okay run this thread pull it off this queue move this around there's some Core data structure that that is offered for you called the dispatch queue And this is really the building block for all these policies So every core has their own dispatch queue that is implicit in the system And has special name SCX just DSQ local And whenever you put a task on that dispatch queue when the thread one that core is about to Look for something to execute. It'll check that local queue. It's all done for you We'll check that local queue pull something off of it and execute But otherwise you can create arbitrary numbers of dispatch queues You can put stuff on it from any core the core logic of Schedex will deal with locking and all that sort of stuff for you And you can do stuff like per numino dispatch queues or a global dispatch queue or per seagroup dispatch queue like however you want to structure your logic as you want and The kernel will you then sort of migrate stuff between the the dispatch queues here and and the kind of local dispatch Queues to actually get them to execute So I'll give an example So here I'm just showing local dispatch queues are really like this is ultimately what the kernel is looking looking at So anything you put on the local dispatch queue Will get executed All right, so I'll give an example right So I want to write a scheduler that just anytime there's a thread to run I put it in a global dispatch queue that all the cores contribute to whenever our cores idle I'm gonna pull from that global dispatch queue and just execute it right You might think like hey, that's a pretty dumb scheduler like I have no cash locality like it's just throwing everything into a queue I will I will like spoiler alert like this actually like outperform CFS for some of our production workloads On some some machines just the ability to as quickly as possible get a runnable Task onto an idle core has some benefits. Obviously you can find large enough machines where just The lack of locality hurts you here, but this is while a very simple scheduler actually like production viable so Basically what you need to do from each core when there is a Thread that's run a world task. That's runnable. It's gonna dispatch that to the to a global dispatch queue By default it'll just sit there. Nothing will ever execute that And you have to then later on dispatch that to the local dispatch queue to get the kernel to actually execute it So here's where I'm showing code This is really all you need to do for the in-queuing side So I'm showing two functions. I've implemented in it and in queue These are two two callbacks that that you have to enable in Schectex and in knit all I'm doing is I'm creating a dispatch queue. I call qid zero And in in queue I'm getting a callback that this task Struct p is now ready to run. What do you want to do with it? I say dispatch it to qid zero this global dispatch Queue that I've given it and there's some flags and I set some slice length so that it knows how long it's supposed to run it for and everything but That's really all the code that is required for in-queuing onto a global dispatch queue All right, so the other side is like once it's on that global dispatch queue How do I actually get the CPUs to execute it when they are about to run idle? So The the the course here consume tasks This function is called consume kind of takes a task from a dispatch queue and puts it on the local dispatch queue To actually execute it So Yeah, sorry and in queue them onto the local dispatch queue as I said the code for doing that is equally quite small in dispatch, that's a callback that is called when You're about to go idle You call SCXBPF consume give the same QID that's gonna put it on the local queue and and Allow it to be executed I've Glossed over a couple details like you may have to be worried about hey if a Core goes to sleep and has nothing to do and it can't consume anything you may need to say wake it up to execute something but This this code is pretty much all you need to do for a global dispatch queue As I said this actually works pretty well on like single socket single single l3 cache machines That we've tested it on and we actually have a scheduler. I've linked to there The call SCX simple. It's literally full scheduler hundred fifty five lines of code Defining the struct everything around it comments all that sort of stuff. So This is super simple, but like surprisingly an effective scheduler on its own I'll give a slightly more complicated example I don't think I added code for this because it is a little more involved, but maybe I want to do something different here where I have you know multiple l3 caches in my system and I find that this global queue actually becomes pretty pretty poor locality drops performance quite a bit for me as as Compared to say CFS So one idea we had was okay I will have one queue for a single l3 and I may have to do load balancing across them So pretty much the exact idea, but in this case I have so two sets of cores that are run on their own say core complex With their own l3. I'm gonna have each of those dispatch to their their own dispatch queue and then Core when it sort of needs to consume consume the next Next task is just going to look up what its dispatch queue should be and consume from that one So you can do these kinds of arbitrary sets of systems so in this particular case, you know, I have Basically partition the machine to sort of two sets of schedulers, right? Obviously as as is this would not be work conserving right I may have overloaded one set of the Cores with a bunch of tasks that's dispatch queue gets super full while the other one is idle So you have to do all sorts of interesting logic here like okay I want to maybe steal from the other dispatch queue if there's nothing else going on here because I Have spare compute to execute Maybe only special tasks you want to steal There's all sorts of different ways to do it. We've we've got a couple schedulers that do stuff like this And there's a lot of knobs you can play around with them and see what works well for your workloads Kind of the whole idea for us is that skedex lets us experiment with these sorts of different policies really quickly It's just a new parameter we pass in and get to see how it works I'm going to talk a little bit about example schedules. We already have I should say we have some meta develop schedulers So these are ones our team at meta has developed. We also have some people externally have developed a number of schedulers I'll talk about those as well I Will be very brief about what all these different schedulers are Rusty is actually fairly simple to what I just fairly similar rather to what I just talked about as a kind of per CCX Scheduler, but it does load balancing on top a lot of different policies on top Was also a demonstration of our ability to write rust code in user space All the load balancing actually happens Simple I actually touched on it is just that global dispatch queue has a little bit more in there, but not much flat CG is a scheduler that kind of tries to optimize C group fairness And one thing we've we've observed a lot in production is CPU controller adds quite a bit of overhead trying to ensure fairness flat CG sort of tries to Fudge the numbers a little bit on exactly how much we provide each C group But ends up being a lot more efficient and it's scheduling And layered is a relatively new scheduler We've worked on that allows you to like really a kind of like handcrafted policies like I want, you know Threads named like foo to run on these cores and keep them You know keep those cores hot to 80% and then run everything else over here and you can do all sorts of policies To kind of affinitize stuff to different cores without it needing to be super strict I See X Russ land is by Andrea who works at canonical is a pretty cool scheduler that he worked on I'll show a little demo from him and The whole idea here is prioritizes interactive workloads over more CPU intensive workloads is very Kind of intense about doing that Let me see if I can there's a video I want to show we'll hope that this works I should hopefully be self-explanatory, but he's the whole idea Hopefully visible whole idea of this scheduler Is if you're running two different things on a machine one You really want to run well when someone's interacting with it when you care a lot less about like say compiling the kernel Can you prioritize the the more interactive workload and he shows some pretty cool results here? So here he starts kernel build. He's now running a game and this is with CFS It is very choppy pretty low frame rate Finds it's it's hard to enjoy he then just from user space pseudo sex rustland right runs his scheduler It's now running Compilation is still going It's a lot smoother 60 frames per second Just by changing the scheduler right and you can do that fully at runtime So this is cool a lot of people in into games have been interested in skedex just because you can kind of optimize frame rates of these things That you couldn't before This was a cool demo from him. All right, so I'll wrap up pretty soon Talk about kind of current status and and our next plans So most important to us, you know at meta when we work on Linux. We have this upstream first philosophy Generally speaking whenever we find bugs in the kernel. We first try to merge the the bug fixes upstream and then only then Use them internally and new features that sort of stuff And the the rationale for that is it really allows us to like follow the upstream kernel pretty closely, right? So every time we want to update our kernel Well, it's it's not a big difference from what we're running internally to what is running upstream because any changes we've made when upstream first Has a lot of value also to get upstream feedback on all of our work Skedex is not yet upstream, but this is sort of our top priority So still iterating a lot with members of the upstream community to get feedback Part of the reason we're you know here is trying to convince people that skedex is useful, right? And you can write schedulers and we want to build a community around it I've linked to the latest V5 patch set and there's a bunch of new features in there adding both BPF features and stuff like that Team from Google made a public commitment to building their own kind of Extensible scheduling framework on top of skedex. We've been working with them on some of that So we're pretty excited about kind of expanding the use case here There's a lot of new features that we could build a lot in the BPF space You know want to make it easier and easier to implement logic We've done a lot of stuff like you can now add spin locks in BPF and you can Ref count data structures and stuff like that and it will make sure that you know you your code is still safe You still you know it will fail if you fail to unlock For example a data structure or spin lock A lot of new schedulers were looking at powerware stuff Various like latency latency nice properties right so giving more latency sensitive workloads more power Soft affinity right trying to isolate cores are all sorts of things We've been looking at and trying to expand the number of schedulers we have try to find what works best for our workloads And a whole different whole slew of these BPF features we've talked about There's a bunch of links. I'm not sure any of this is visible, but the main repo is on github This is where we maintain the Linux kernel patches. We then also have this example the third link actually Is where we keep all the schedulers we have been maintaining people can contribute to them We kind of managing this this repository of a bunch of different schedules that people can check out and play around with And there's a link to documentation there as well That's it for my talk, so I'm happy to take questions now. I don't see a microphone. So if you just want to shout I can oh I'm curious when the Virtual machine loads the new scheduler policy a scheduler is clearly already running. So how does that switch over happen? Yeah, so you It's so there's actually like multiple schedulers and Linux at any one time you have like the FIFO scheduler and They're like layered in some way to say like okay FIFO scheduler has something to run run that before CFS so In the hierarchy like like skedex sits there and if there's something loaded it will run stuff But you fundamentally need to change a thread and say like change a task and say this task is now No longer running on CFS. It's now running on skedex There's a helper function actually in Skedex to do that when we load the program So very common is that you load the scheduler and one of the first things the scheduler does is switch all the tasks to start Running with skedexed and that will change them from running in CFS to running in skedex And now all the tasks are gonna have the callbacks there for for skedexed And it'll start running and as soon as it exits it'll move them back to to CFS So technically both schedulers are running at the same time. It's just Usually it doesn't make a ton of sense to have You know a bunch of tasks on either scheduler because neither scheduler is usually aware of what the others doing and you don't want them competing It's not it does not freeze everything as it's going. It's not like a super cheap operation But like it's like we we are doing this while production workloads are running. It's not like visible benchmarks on like the effect of Other processes especially at the for like kernel functions for interrupt handling of like high-throughput networking High-throughput disk IO Is there any impact to having you know some stuff in this scheduler and the other stuff, you know And then my second question was Have you tried running k-workers in your using skedex? Yeah, so to the first question Obviously or ultimately like really care about is like and application performance, right? So Which is like a very complicated like we get one number, right? Like how many you know requests were served by my web server and if that goes up great if it goes down like terrible, right? We've played around with a bunch of different schedulers as I said scx simple like was 1% ish Performance improvement over CFS on some of our production hardware not all of it We've got other schedulers that you know configurations the scx layered for example that we find Similar kind of performance wins across more servers We haven't looked at specifically like high-throughput networking or anything like that one thing we've been finding a lot of is Kind of trying to isolate disparate workloads across disparate cores is really valuable so put my you know My web server threads here whereas you know, I have like background threads doing a lot of work Don't even try to share the same cores like actually try to isolate them to separate groups of cores ends up having like Nice locality properties and wins for those The second question was k-workers So certainly most of the time when we're experimenting we move all the tasks over to Skedex so they would be running under it a lot of k-workers are pinned to a core So they're not not super use interesting from a scheduling perspective. It's like run it on that core, right? but we haven't explicitly Experimented with like moving k-workers around in any any particular way So you mentioned that you right now all of them just basically immediately move all tasks over to the skedex Do you have any sort of intelligence where you basically ignore real-time? Threads and don't move them because that would not probably be what you want Yeah, I believe though like the the function and it will only move like default class stuff that runs under skedex over to To the skedex class so to make them run like leave real-time untouched it's like We are not often running real-time threads in production It's just like hard for us to reason like hard for any scheduler to reason about like multiple schedulers running at the same time unless you're like partitioning cores in some way, so At least in our experiments has not been something we've played around with but if you do have real-time They will remain real-time and Keep running. Do you think this could be potentially used for testing programs under different execution orders? Yeah, it's something we've looked at a little bit So the idea like just elaborate on the idea tell me if I'm misrepresenting it, but like, you know, let's say I have like a multi-threaded application and Under some interleaving of threads like it fails. How would I uncover that? With just running a schedule you just run like a ton of times and try to hope that like eventually you get the bad interleaving But if you actually have the scheduler capable of doing it Yeah, I do think it's viable like What I'm familiar with a lot of the techniques that like explore this do this basically in user space. They like interleave it themselves, I'm not sure there's huge advantages to doing it in skedex versus in user space and But would definitely a way to like explore the different interleavings You probably want to have like application level information anyways to know like, okay now I'm hitting a critical section. Let me like make a scheduling decision So but there's definitely been an idea with thought through a little bit Cool. Thanks so much. Hello Hey, I think we'll give a couple more minutes. Okay Hey Welcome and thanks for staying until 5 p.m. Okay, so I am Tejun. I work at Metta and this presentation is about IOCost and something called resource control bench Let's get to it. So who here knows what a C-group is? Okay That's that's a lot of people who knows it Who who here knows what a C-group controller is? Cool. So well, okay. Really simply C-group is just like a directory I cannot get comfortable with this mic So C-group is like a directory, right? It's like a tree structure you create and you you put like processes in it right, so it's like a directory of your processes in the system And and because it's like a part it looks like a fire system is like a tree so it is a hierarchy like something is you know parent of something else and and so you can organize your the processes in your system that way and and if you use system in your system That's what system uses by default, right? I mean not not by that just by default I'm just the only way the system works is the system decreases logical structures using C-groups That you know this service is composed of these processes, right? So you and these these these services are under you know this slice Which is kind of intermediate node that system decrees. So it basically if you look at the right upper You know tree structure like the root is the system root and then you know They're like system that slice work load the slice host critical the slice whatever you want to create So you can kind of broadly categorize the the processes you have in the system and below that you can have like services And you can have another another slice to further you know categorize them but basically it gives you that organization and then Secret controller is that once you have like this tree structure, right? Now now you can do something more interesting with them that you can say things like you know for this You know sub hierarchy of tree, you know give twice more CPUs than that part of the tree, right? Or I you know for this guy, you know give give twice more, you know BPS than that guy, right? So you you kind of you have this opportunity to Distribute your resources according to your you know hierarchical logical structure of your system So I owe I always here and it's here, right? I mean it's a lot of different things here I we are just talking about blog Ios, right? What you read and write from disks or SSDs So yeah, so I of course is a secret IO controller, right? It attaches to secret hierarchy and then it distributes You know SSD Ios This is mostly right some people still use eight hard disks, but mostly SSD Ios to different parts of the system So why why is that challenging? Why? Why do we need something different? So one of the biggest challenge is that The units or the the the metrics that we can observe trivially Are not great. So if you think about IO devices like what do you what do you count if you want to count? It right count his behavior Through food, right? Right, right. Yeah, the two right. Those are the two big ones, right number of Ios, right? How many bytes are you moving right? Those are the big metrics, right? They not great. These are not great. So if you think about like I can give you a really simple example, right? Let's say you have like a mediocre SSD you bought from Amazon for like $20, right? And let's say your IO's budget is say two thousand, you know, there's low but you know, just for example, right and For if you do random is two thousand, right? So you think Okay, if I you know, give this guy, right, which is kind of low priority I would give this guy hundred, right and you know the rest everybody else can use then, you know I'm sad, right? You think that and that low party guy now like like Streams like you know multiple megabyte Ios, you know in in the sequential IO, right and completely this interest device, right? So that the problem here is that the same number that you you can get in terms of bytes or number of Ios Can be both at the same time to low and to high, right? Depending on your workload, right? So you don't have a good configuration number, right that the range it doesn't exist So that's one big problem, right? So that makes, you know a single metric base control, you know based on these numbers really challenging because these numbers are just not representative of what's happening on the device how closely they are on the device The second one is that SSDs are really erotic I think they're getting better nowadays than before but you know steel, right? So what you see in the flood is that you know, we have a lot of machines And we buy a lot of really cheap SSDs for some reason that I cannot understand And and what happens is that you know the things seem fine, you know But like once every two weeks, you know like, you know with this workload this SSD stills for like 40 seconds, right? And you know nobody like so SSD store for 40 seconds, right? Nobody is happy, right? I mean your workload is just stalled, right? And you know, it's not good. So those things do happen I mean that's an extreme example, but if you think about SSDs like these are not simple devices anymore Right, I mean they have their own memory. They have cash, right? I mean like even in their own flash they have like SLC section or NLC section and they do garbage collection, right? I mean so the performance characteristics characteristics you get Is not very consistent in a lot of cases, right? They because they optimize so much and they try to they try to optimize for like common benchmarks, right? So, you know another problem is that they are really fast nowadays So even, you know the thing that you bought from like Amazon for like 20 40 bucks, whatever, right? You pay They're just really fast, you know Many of them can do like hundreds of case of IOPS per second and that that that poses a challenge, right? I mean so if you're whatever is trying to control your resource is a little bit expensive And if that happens like hundred case of times, right? Now it becomes a problem So it's kind of similar to become becomes a little bit similar to like networking control problem You know the reason why that's challenging is because the frequency is so high, right? So frequency becomes really high The the last thing is that it's really closely intertwined with the rest of the system. So for example Right so like for example like certain is are more important than others, right? For example, if you're doing say fire system metadata update, right? Unless you're writing to a journal, right? And in a lot of fire systems, right? The journal is a serious stream of data like different users or different applications don't get their own journals, right? So everybody shows one journal and if one IO stores in the journal the whole writes all the other writes to the fire system stores, right? So now it kind of becomes a problem This guy was writing to this fires I have to write this journal, but this guy was low priority So I controlled that IO I slowed on that IO the whole system is slow, right? So, you know, that's that's another problem. So this happens in the Lot for fire system metadata, but it also happens for memory memory reclaim because memory is a shared resource that everybody shares So if you slow down memory reclaim of us over low priority process High priority process might be waiting for the memory to be reclaimed, right? So there's a kind of natural priority inversion possibility there So how does I I also solve these challenges So the first one with single metric, we don't have good number problem So the way that I also said a problem is that it has something called a cost model So it doesn't just look at like single metric like a BPS or bytes but it Combines right now linearly combines the different aspects of the IO to calculate Cost to assign a cost to a given IO. So the way in practice the way it works is that if an IO so it uses several criteria But if an IO is a random IO compared to its previous IO, right the base cost is higher, right? And then there's a cost associated with the size of the IO which you could a linear skills without how many bytes you transfer and Then you know depending on the it with easy read or is it right? So there's a different overhead. So you combine these different features into like this, you know Linear equation and then you you have these parameters and they combine into give you a single number Right now it's a fairly simple linear model And it's a lot better than in a single metric, but you know the model part can be further extended if necessary So that's how we sort of see cost part, right? How how you measure how how expensive a given IO is The second part is that So another part is that like when you for example if you configure IO Distribution right according to say IOPS or BPS like this hard metrics. The problem is that it's difficult to know, right? Like if you woke up to an application developer, right? How many how many mega BPS is do you need, right? Usually don't get a good answer. Nobody really knows, right? So that makes the configuration really challenging So one thing that I also does is that it's a proportional work Converting controller, right? So what you say is that this guy guess, you know x times more than that guy, right? That's what you say at each hierarchy level But you don't say you know any specific number so any is work concerning So you don't really lose the total amount of work and it kicks in only when there's a contention around the resource, right? if that that I was not contended, you know The SS is always going to be performing work if there's work to be done So it's fairly easy to configure. You're just saying that you know this guy is generally more important than that guy rather than using like a hard number and the going to the performance section the Iocost is implemented in two separate pieces to kind of interacting pieces one is called a slow pass or planning path that runs, you know every I don't know five milliseconds depending on the specifics of the device and the model 10 milliseconds or something like this on millisecond scale And it does all the all the complex things, right? I mean who's active, you know Who should donate the remaining budget to home? and it does all those calculations and then it just kind of you know Like plugs those numbers into the the issue past the execution path The execution past is a lot simpler and they just kind of make Slaughtering decisions locally without really talking to each other So which kind of makes them really cheap to do in the hot issue path And that's how it, you know, Iocost while implementing fairly high-level control Abstracted control mechanism the execution is really simple. So the High-level abstraction is implemented in the planning path, which only runs every, you know 5-10 milliseconds and the IO path which can run, you know, you know Sometimes or hundreds of microseconds. They are really simple. So that's how it achieves high performance And it interacts Integrates with the file system and memory management path to to understand that, you know, which Ios are important Which Ios cannot be throttled and the way that it Dway that it handles that is if the Ios are expected to cause priority emergence If throttled it just always just issues it right away and then, you know, charge it as backed To the issuing C group. So it make them it make them pay later. So do first and then pay later So that, you know, it doesn't, you know, slow down the whole system This is a cool diagram. Somebody else throw For a paper Which is it just kind of shows the what I explained in a diagram So the offline device models, we will talk about that later. But that's how you derive this, you know Parameters to calculate the cost and then we take those numbers and we configure it And then there's the issue path and the planning pass playing pass runs, you know slower and then it does the complex mess stuff and then it plugs in the numbers and the issue pass, you know issues Okay, so how does it work in practice? So this is a benchmark on one of our, you know, more expensive SSDs Which can do 750K IOPS, which is not like super high because, you know, higher ones can do like millions easily these days But this is comparing just really simple like 4k when to read, you know, how many K IOPS can we do, you know, with different schemes employed, right? If you do nothing, you can do, you know, 750K, you know, 100% You do different things, right? I mean, like for example, like a bfq here performs really poorly Which is expected because like if you look at bfq, you know, it's Primarily designed for like hard disks, right? I mean, it doesn't really have It doesn't optimize in the issue pass. The assumption that Like older IO controllers or IO schedulers used to make was that you can pay CPU time to optimize IO Because IO is so slow and CPU is so fast. So if you do IO's better by spending CPU, you can get better result That is what bfq, you know, the assumption bfq was making Not true anymore. Just because so big SSDs are so much faster. But I mean, you can see that, you know, you can maintain 750K IOPS Does it work? How How does control actually work? Discreted needs a little bit of explaining, but basically you have a workload Which is latency sensitive in that like if the IO takes too long, the workload is that would ramp down, right? I mean If you think about for example, like a web server would be like this, right? There are only so many like, you know, load balance environment If you're if you're slow to work, then you get less work, right? So you're you're you have like certain SLA that you have you have to meet to keep to be able to get work So it's some workload like that So as long as latency stays below certain threshold in a keeps issue more IOS And then we have two of those and and we are trying to assign like two-to-one ratio, right? I mean this guy is twice more important than that guy and blocks little that's perfectly the reason that it does perfectly is because you know, I configured it with IOPS, right? So it's an IOS benchmark and IOPS was two-to-one, so it is two-to-one BFQ struggles a little bit, IO latency struggles a little bit, too. I of course, you know That's okay, right? It does two-to-one distribution and you know, it would be a lot more robust than block's role Right, even if the workload changes, it will maintain that distribution I didn't I didn't put that slide here Graph here, but like one thing that I do another experiment that we did was that do the same experiment but change it up one of the workload to a sequential workload, right and in ideal case you You would expect that the other workload to be maintaining similar level of IOPS, right? And you know that actually happens, but So I of course works is great And so it's a fully deployed at meta In almost all of our machines and then this is just So what you did is that we wanted to I forgot the exact details, but This is like we have like a lot of services right a lot of services which maintains these machines and then when I use become Not isolated not not distributed fairly like certain things fail on the system, right? Like for example, here is the packaging fetching now you you're downloading the Package for the service for the next version to run and then if the disk is too contented You know it the right becomes too slow and it times out and the thing fails, right? And if you set up IO IO controller correctly, then no matter how busy the disk is the Service which is writing the package would get enough sufficient bandwidth and you do succeed So we are just deploying the I of course across the fleet, you know over two months And then you can see that you know the failure rate drops over time as the isolation You know become working works in the fleet Another similar thing basically the same story is the different Part of the service is a container cleanup path the same thing right as you deploy you see the airway drop So I of course works well As long as it's configured well And the problem is that this I'm not sure This is legible, but No, that's it that's the intention Right, so basically you have like how many like a 12 or some number of like almost random looking numbers to configure Right, how are you gonna configure these numbers, right? So there's a lot of numbers. How are you gonna configure them? So the corner tree if you look at the corner tree There's a tool see group I of course cof gen up high script, which is that's a really simple like file Benchmark basically measures your maximum, you know, we've been with Maximum right bandwidth maximum 4k, you know with I of maximum foot and then you know it outputs the parameters, but I Did say that, you know, the SSDs performance characteristics are really, you know, not not easy to predict Right and they optimize for these kind of tests, right? I mean they perform really well for this test and then when you actually give actual workload. They just So, you know, they lie. So, um, how how are you gonna go about that? so research control Bench is an attempt at answering that question The problem is that the The performance characteristic is kind of really multifaceted when it's complex It's not easy to characterize So one approach that we can take that this this this approach is that is that we come up with a scenario, which is Mostly plausible, but it's similar to what we see in the in the in the production environment and then try to see whether, you know, I You isolation works well enough for, you know, that particular scenario It's a fairly, you know, in that sense It's a fairly narrowly scoped but at the same time the body is fairly low We don't have a bar here because, you know, nothing really works well, but and also we like Experience has been that, you know, something which works well with this kind of testing tends to work well for a lot of other stuff So the scenario It does is It actually is derived from our production Experiment, which is that we run on web server, which is obviously latency sensitive, right? If latency goes above certain level, you know load balancer will just kick it out and then We we run an aggressor work load, right? Which is usually a memory hog, which keeps expanding memory usage And then, you know, that's a lot of reads and writes and then that that creates a lot of pressure on the system Both because that generates a lot of swap and firecash isles But also at the same time because it creates memory pressure Which makes the web server itself shrink and that generates a lot of bias so everything becomes kind of IO dependent So in that environment, right? I mean can can can this thing survive, right? So what what it does is that it starts the benchmark what the benchmark does is that it Recreates a similar scenario using a pseudo workload and then aggressor and then it tests one panel, right? And then see whether isolation works well and then it sorters down the device a little bit more, right? And it keeps doing that, right? So it keeps it keeps the collecting data point across the throttling level and try to find out where the device actually can isolate So that's what it does and And it does and then it produces like this this kind of graph I mean produce a lot of graphs but there's some of them So this like blue dots are the data points And the X axis is you know, how much the device is being sold for a hundred percent is that I'm not sorting that device at all You know ten percent is that you know, I'm only letting through you know, if the device, you know Maximum is 100 mega bps, right at ten percent. I'm only sending in ten ten Megabytes per second and the the Y axis are different, but like if you look at the the left Lower lower left graph is probably the most interesting and that is a graph showing How much it bandwidth isolated bandwidth you can achieve at different Throttling level, right? I mean and the cool thing is that you see that that kind of you know that angle shape Right, so there's a peak and it goes down So what it says is that like when you don't throw to the device Right the amount of bad news isolated venues, right? I mean successfully isolated of bandits you get it's actually lower right because you know, you cannot isolate well, right? But as you sort of the device the amount of Isolatable like protectable benefits it goes up and there's a if there's a peak where you hit a balance Right and then you sort of further then now you're searching too much, right? I mean just comes down that make you hit the maximum and it's kind of pushing it down Right, so for this workload, I mean you would obviously pick that peak, right? This is the the the best isolation point slowly point So excuse me, so it runs a bunch of tests, you know, like plus level this in the graphs and then you know once I'm curious takes and then You don't need to read it. I just And and it provides the solutions. The only part is Is that which is important is that a solutions part. They're like multiple sections, right? So it says that no, you know, on this criteria, this is good parameters on this criteria. These are good parameters, right? So you just kind of need to pick, you know, which one you want and you copy paste these parameters and then So this sounds like a lot of work, especially because Benchmark wants for like six six hours And this is the reason why, you know, it runs for six eight hours is because they lie You cannot trust them after testing 10 for like 10 minutes, right? So you kind of really have to burn them So but the good news is that the models are Compatible or or trans label translatable across different instances of devices for the most part So you just need to run, you know The benchmark a few times per per model, right? Then if you have the the parameters for one model of the SSD, you can just apply all the different SSDs of the same model So the best way of doing that would be Everybody running it on their devices. I'm not everybody some people, right? I'm why the number of people more number of people running that and then if they can just collect the parameters Right and build a common database Then most people can just download their database and pick the parameters that they want and use it, right? So that's what we Have been coughing for three weeks now Having toddler sucks So that's what we try to do I hear So so we try to do two things one thing is that these are going to bench is painful to run Because it runs really long and secondly because it needs a lot of like system level setups You need like a bottle FS root file system like swap, you know certain size and all that So one thing is that we created the insoluble image that you can download on your USB stick And you plug it on a machine then you can install the Test the benchmark system image onto the target device I'll show you later and then you can just run the benchmark from here So, you know, you don't have to worry about how to set up the benchmark It comes like can't right and then once you get the benchmark result that way you can Upload it to like a Github repo and then it will automatically change the database from it. So everybody can collect the data We have the build a system having really collected many of benchmarks. Yeah, okay, so this is um, yeah I chickened out of live demo because it takes eight hours. I mean if you guys want to stay, right? I mean I can But anyways, I just you know circling captured. I I ran it like a couple days ago So when you put this image, you know, you ask it puts into this Then you press enter and then you can select your target device So this is my test machine with a bunch of you know, all the SSDs. So you select which you know SSD you want to test and Then it says that you know, it's gonna destroy everything on the SSD And then you know it installs not very pretty, but it does then People don't don't shut down just people it means that it doesn't reboot It just kind of moves into that Installed test image and then you can run the benchmark Right away and it basically, you know in the banner It says that you know type this share command to run the benchmark and you just type that and Then then it you know it says that you know This is the benchmark. I'm gonna run, you know And it does things and it does things over and over and over and six hours later Right, then it speeds out the result like this And then it stores the result into your USB stick if you left it still in there And the next step is that you take that USB stick and Then you plug into you know a working machine and then and then like you know it in the Wizard Wizard's directory There's a PDF that I opened right it has all the results or the graphs, you know And then there's a text up or text up for this actually it contains a lot more data If you want to look into that it shows like really detailed like a latency behavior of the device across the testing I mean, so it's actually pretty good way of you know evaluating how bad the SSD is so And then so there's a JSON file right there's a JSON file Which ends in that jason.gz. That's the actual data file which contains all the results and You open this website get a repo called Iocos benchmark Iocos and benchmark and then you go to issues and then you press new issues and then it says Benchmarks of mission get started and you drag and drop that jason file into the issue and then you submit and Then it says something and Then then what is that what happens in the background is that? One thing one other thing the benchmark does is excuse me If you I said that the benchmark to like six eight hours right so that collects I don't know maybe 20 data points right but graph feeding with 20 data points on consistent devices good On erratic devices. It's not good. You need more data points So if multiple people or you know same person runs the benchmark multiple times and then they issue it then on the back end If they combine all the data points into a single single data set and then they fit the graph on that one So which gets a lot more accurate? So when you you know something this benchmark what happens is that in the background like in the CI pipeline emerges them it does all the calculations and all those things and then Eventually it comprises that into a data database file which can be fed to a system D. There's a system D Feature that you can fit this feature into then you can just in the system deconfigure You can say that for this SSD use you know this type you know this this this type of mod parameters, right like Like one of the five five or four Parameters you can pick so then everything becomes fairly easy, right as long as there's you know People running the benchmark and and something the submitting the wizards then we have you know Database file which covers enough devices then what all the users have to do is they're just picking which model they want Which parameters that they want? So that's that's some I of course and the excuse me. This is control bench And these are the links I of course has a paper So it explains, you know how it actually works in all these details And then this is control demo and this control bench is in this link and then there's a result of depository There if you go to the I of course result repository, then there's a link to you know how to install the image and all that So should be fairly straightforward although the documentation can can definitely be improved That's the that's the presentation any questions. So we talked about variability in SSD performance or lack there Do you have a criteria for which you can use to consider an SSD so bad that you don't use it? Well, um, yeah, I mean So I think it's better these days, but um, what what you see when you run like Even you get these grabs You can you can tell a lot from When you get these grabs You can tell a lot from them So if you run this benchmark on say, you know, like say Samsung right 980 pro, right? The dots will be like it's a really clean graph, right? If you run it on We had a lot of trouble with like pushy bus Maybe not the current ones, but like two generations of push bus were really bad in our fleet and it's just you know It's just crazy, right? I'm like, you know, like two-thirds of the dots are on the bottom, right? So that's a really easy way to tell like which ones are performing really poorly and if you actually like compare these grabs and measures It's relatively easy to tell how bad or how good given devices and is for this type of benchmark So I don't know, um, you know, whether something is susceptible to you know to an organization or a person You know really depends on their criteria, but you know, and this gives you a way to compare them You presented the this system kind of you'd prefer to have these parameters configured to IOCost on this system Do you envision, you know, having a system that earns these parameters on the fly or will you always need this type of benchmarking ahead of time? So, um, I mean that'd be ideal, right? I mean you can just figure it out on the fly On the hard disk, it's a little bit easier. It's more feasible to do So if you just put the system right and then you know enable IOCost It you're going to like auto mode, right? The parameters are auto and you try to fix something reasonable and that does work more or less okay for hard disks Because it's QDepth is kind of low so like so IOCost, I didn't explain this, you know during the talk, but IOCost has like a self Modulating like an adaptive mechanism, right? So what it does is that it has a QS parameters So if the device misses QS parameters, right? Latency targets, then then it slows down. It slows the device further, you know, thinking about the guys, you know struggling so slow it down, right? And if it keeps hitting the Latency target, it just kind of inches up, right? More IOCost and more IOCost until it kind of chokes again So it has that mechanism and one signal that you fit that gets fed into that mechanism is the QDepth's Depletion, right? So if you try to issue an IOCost and the device doesn't have an open, you know Commence law. It's a clear signal that the device is saturated, right? So the good thing is that on SATA, right? There's only 32 Commence slots. So, you know, you can get the signal really quickly. On NVMe, there's like 4k You know, by the time you get the signal, you are too deep, right? So that signal becomes kind of useless So So, yeah, I don't know. I mean, it'd be really great if we could but I haven't found a way to get good signal Have you investigated whether any NVMe's Like SSTs vary over time like when an SST is brand new and it doesn't have any like Balancing and reallocations versus an SST that's been in a machine for seven years It might have a rather different performance characteristic even though they're the same model Wayne, not really. I mean, I'm not saying that that's not an issue. That's definitely an issue But in the fluid NS we have seen up until now Is that there doesn't seem to be like a drastic decrease of performance and it's you know, this within And also another part is that it's difficult to kind of measure Model that right So we haven't done it and another kind of anecdotal thing that I observed was that so I read this benchmark a lot of my SSTs, right? and these are really They kind of really beat them up That's a lot of right and means and so I killed a few SSTs if you keep running it for like a hundred times on SST And they all seem to be like showing similar parameters for a while and then just die so Hopefully this you know that holds for a lot of other devices Just wondering if you keep track of tail latencies around 99.99 percentile Latency in your fleet and whether or not that is controlled better once You're running IOCOS versus Yeah, I mean, that's the main one of the you know key metrics that we watch So we do watch for tail latency pretty closely. And if you look at I'm not sure I have So this is the like text output not the PDF one. So if you like Go down like you see these tables This is like a forum run it shows them like the the latencies that it observed And it does pull out the chart latency chart to but this is a lot more detail. So what it shows is If you go to the website it has a more better description, but what it shows is that It shows like a full example like P99 nine distribution over time, right? So it measures P99 latency every second Right and it shows what's like P99 of that, right? So the reason why that's interesting is that so there's a like a different failure mode for devices So for example, let's say you run a given workload for an hour and your P99 is say 100 milliseconds, right? And that's say, you know many many gigabytes of IO, right? So there's a difference between that That once one percentile being distributed evenly and that that one percentile Happening, you know in like, you know, like say 30 seconds time window, right? The first case is fine. The second case you're gonna notice it So what you can see is that if you look at like say P50 and you follow down P P 99 what it says is that the median latency Over a second can get that bad, right? So it gives you a lot of insight into how the SSD is actually behaving So you don't want devices, which it shows median high tail latency, right? So something sort of related to two of the previous questions so in your graph of Performances you had two drives that had really beautiful graphs and then you had one that had a really dramatically changing graph I'm wondering if You observe that has really variable performance and if on a very variable performance system the like adaptive IO cost actually works better So Clarify all four graphs were from one device. It's a different aspects, but like what you said is still true So for some devices, you get really nice clean graph for some other device You just get a cloud, right? And in in practice, I Don't know so the thing is like depending on how cloudy the devices and and it's also kind of limitation The extent of the experiment that we have done is a little bit limited So up until now what we have tried is that we just when we get this cloudy devices We just kind of throw three down fall down enough so that the you know the data point that we get there is and is somewhat acceptable, right? So that's what we have done I'm not sure how well they would do with adaptive thing The problem is that other if you think about I'm a little skeptical The reason why I'm skeptical is that if you think about the adaptive mechanism the time scale it works is in seconds They were 100 milliseconds is something like that, right? I mean it observes each period and it adjusts adaptively, right? And the device this device is when SSDs they misbehave is on minutes or hours timescale, right? So you know it talks every ten minutes, right? There's no way the adaptive mechanism can capture that so there's kind of this where you need And then my other question is to use AI for scheduling is is it possible to use AI? For I am scheduling I mean we tried to use machine learning in the is caddxd caddx But they didn't There was First is for the same SSD, do you also take into account the over provisioning and second is Do you think the vendors would agree on running these benchmarks and then publishing it? The first one we don't Yeah, I mean also the other side of the answer is the test is kind of limited It doesn't use a lot of the disc. It only leaves the kind of smaller portion portion of the disc So over-projecting doesn't shouldn't really affect The performance that we see so it doesn't test a scenario where the device is in 80 90 percent filled up It just assumed that the device has enough space available So test that right now the second is that Like internally, I think we can ask right? I mean vendors to submit the results But I'm not sure they would agree to publishing it I was more hoping that you know people will just run it and publish them what they're what are they gonna do? I just wanted to ask whether you have any future plans to develop. I'll cost any further I'll cause the in do you have any future plans to develop? I'll cost any further I'm generally pretty happy with how it behaves and in our fleet, you know, it seems to work. Well, I Want to see kind of wider adoption in the more people using it They'd be exciting and like the biggest challenge there is that it's just painful to configure, right? I mean so what I really want to see is that the in terms of core feature Probably, you know, nothing new right now, but you know if you know if we get you know The parameter configuration figured out and more widely available I'm sure did you show the deficiencies in the current mechanism and So two questions here first, do you have any thoughts about how to make useful benchmarking that doesn't take six plus hours? and second So it's very interesting that you can use this to kind of understand from your Bleets perspective of like what are good or bad SSDs to then eventually like Raise the quality without, you know, spending excessive amounts of money, which is like get the whole principle This is from the perspective of someone who's like maybe looking into You know building a desktop or trying to evaluate, you know built-in SSDs that are being used for laptops or whatever Where can you see the? How do you see this kind of fitting into that? Kind of all because I think that's really where a lot of the people would be motivated for you know Just doing this in the crowdsource way, right, right So here's the problem with I think resource control is that the whole secret thing, right? Who's whole IO memory and all these things is that on your like personal devices? Maybe not on the phone, but like on your laptops, right? You usually have like more than enough resources, right? So nothing is really contending for anything, right? That's the case for most people So if there's not a like usually not a pressing need to get you know We just control figured out so that my bash is never affected while Connor is compiling with some, you know People don't really do that, right? I don't know about you, but I'm doing a lot. Oh, yeah Actually, right, right. Yeah, and all seriousness, right like yes, I do compiling in whatever in VMs Actually really it's a real problem about IO performance for playing games because games have gotten so bloody big Okay, and it's it. We're now hitting the same kind of IO issues that you tend to see with database workloads and fun stuff like that Which is a horrifying thing to think about in the first place because all that's like That's the part of me being like a person in the industry and enterprise that I really don't want to bring into the home But like video games have made it a thing. Okay. So, so, um, let me but For the most part when people play games, right? This the whole thing the machine is doing but here's I think what can be interesting, right? So when you buy an SSD, like I don't you want to know this thing is consistent or not, right? I mean, you know, if it's gonna like store for like 20 seconds every, you know I don't know two hours. Like don't you want to know that? So I think in that sense is really it can be really useful, right? So if we build a database, I'm not sure we have that right now, but If you look at the if you go to the people to the repo, there's the Okay, if you go to the repo there's a Database directory you go into that and there's a models and then if you enter that there are like the Jason files Right, we can we should format it and kind of show what the actual result is in easier format But you know that would be like a really good way to compare this as it is this SSD I mean it's not just maximum Mega pps or whatever right, but it's a latency and right like that. Yeah This is how I would be interested in seeing which is why like is there maybe a way we could have a benchmarking profile That's not as intensive doesn't take as long but can still provide useful output and results because From that context, and it becomes interesting to do things like well, maybe you know YouTubers and whatever that are reviewing computers can run these things and contribute these benchmarks Prox could include it in their pts things like that where it's not completely asinine to run the whole thing Yeah, please let's not I Don't know just not YouTube cuts you off at 13 hours. I learned that I Yeah, but The one one easy way to kind of tune down the benchmark You can do that with the parameters is that you can just you know tell you to collect like few a number of data points Instead of training collect full right and you five people collect for this as good as you know 120 So there's that's one way to go about it The only thing there is that the only thing there I think is that It's just kind of you need a lot of data points to make like high quality Higher quality graph feeding and then run these three sticks. So we would still need To truly understand the edge case performance of the drives wouldn't you need to write through the drive twice and then do overlapping IOs in Random patterns to kind of force the drive into its ultimate GC Bottle neck in which case that you're seeing characterizing the performance of the GC algorithm on the the drive itself also, the other question I had is How do you compensate? I guess Would it be possible to compensate over time as bad blocks accumulate and you change the characteristics of the GC algorithm and Tune these parameters. So then when the drives age you're able to change the IOCOS model Yeah, I think that would be possible You know, it's not doing that right now the way that the way that So one challenge is that like all these kind of specific behaviors would be why it can be wildly different depending on the specific vendor like former version Right different generations of devices. So it's kind of really difficult to come up with something precise so the goal the The approach that research bench took was that just you know bring a really big hammer, right? I mean, can you survive for you know eight hours, right? And we are take the number that you can you know, we will take the number at your worst, right? That's the approach it took not saying that's the ideal one But you know, yeah, but it definitely can be improved. See that's where we need the AI to come in So So something was just said that made me wonder about this. So you said that this runs for six hours or so What percentage of the total Capacity of the drives are being written during that period so Not big It's fairly small like some some tens of gigabytes But the thing is that these areas are being written over and over again And then most the way that most SSC's behave, you know, they just keep accumulating right and then GC kicks in Well, that's where I was going in other words. The total number of bytes written right is Typically larger than the capacity of the drive forgetting about The actual block sectors that you're talking about which are really irrelevant on its SSD So you do force it into GC. Yes. Okay. That's that's my point Okay, because I would assume that the stalling behavior or whatever else weird that's going on has to be related One would hope is related to that and not just the firmware crashing You know, I forgot the count but it ends up writing the whole capacity multiple times over in that time period Thank you. That's why it kills the SSDs if you run it too many times But I mean it doesn't kill it immediately But if you run it like hundred times or like 200 times, you know, you know, you might run up against your lifetime Yeah, so you're you're you're sending it to it's early grave, but not right away Thank you so much. Okay Does it work? Yeah, seems like it. Okay. So hi everyone I guess it's time to start so My name is Quentin our work on meta and I'm gonna talk for the next maybe not maybe not full hour About BP filter, which is a mechanism for packet filtering using BPF And we'll see how we can leverage that for IP tables and NF tables So as I said just before so my name's Quentin I work on meta in the Linux user space team So we aim to contribute to open source projects such as system D package management And also BP filter which we will discuss during this talk. So I've been working on this project since more than a year now And I'm happy to present it to you So first of all we gotta go back a bit not the whole talk will be talking about IP tables Don't be afraid, but just to give you some context and to have some refresher about IP tables and how it works So IP tables is a quite old tool is from 1998 and so It's a packet filtering mechanism, which was up to some point that the factor standard on Linux and The way IP tables is structured is quite simple. We have the schema on the right So basically you have different tables which can be nuts mangle filter And that table will define what kind of What kind of processing you want to do with IP tables For this section and for this talk we'll focus on filtering which is filtering back packets basically So we have the filter tables inside that in Sorry, but my kids Keeps moving. Okay So we've got the filtering tables in that tables. We have different chains the chains means Where do you want to filter inside the network stack for this example? We've got the input forward and output chains and in each change In each chain you can have worn on or more rules and the rules define a set of criteria And when the criteria match to packets then the action is applied The first one we can see at the bottom here is if the packets is ICMP then we drop it So as I said, we will focus on filtering table for now When it comes to IP tables the IP tables defined in 1998 it is Slightly different from the one you can use these days Who knows about IP tables legacy? Yeah, so IP tables legacy is basically the IP tables from 1998 which communicates with the canal using the syscalls get suck up and set suck ups Cisco is that clear does everyone know what a syscall is who knows what a syscall is Who knows what a syscall is but would like me to explain for the other people in the room? okay, so Cisco basically it's a way for a user space program to communicate with the kernel and to have the kernel to perform like a privileged task so IP tables the binary running in user space will call a kernel function Which is get suck up or set suck up to perform something on the kernel side Now that we have the basics about IP tables, let's see how to use it So the workflow is quite simple. So you use IP tables you call the IP tables Sorry, I just Realize explaining syscalls So IP tables legacy is using get suck up and set suck ups and the IP tables you can use nowadays on your computer Is using netlink to communicate with the kernel? That was a difference my bad So the workflow is pretty straightforward So you call the IP tables binary on your machine you give it some parameters And if you want for example to create a new filtering rule IP tables will first use get suck up to get the whole rule set from the kernel It's gonna modify it and send it back use it set suck ups That's a quite heavy process So if you have to insert a thousand rules is gonna use get suck up a thousand times get the whole rule sets modify it and send it back It can be quite inefficient for huge rule sets And the data contained in the get suck up or set suck up called this binary It's a specific true structure defined in the in IP table sources So what's wrong with it now? Well, it's quite old. It was defined design in 1998 and nowadays the way we manage networks the way we the kind of traffic we have the bandwidth we use is Completed difference and Quickly IP tables if you have Lot and a lot of rules can be a bottleneck when it comes to Unlink the traffic on your host As an example if you have let's say 128 rules to filter different IPs Then it starts to slow down and you won't have the full bandwidth It depends on your host obviously, but So yeah, so IP tables is showing some Limitation when it comes to modern usage of network and The thing is IP tables has been for a long time the standard way to fit their packets That's what document and a lot of people use it and when you have something that works You don't really want to move to do something different like the newest shine new and Chinese stuff so how What can we do to improve IP tables make it faster? While not forcing people to move on Well, that's where come BPF into place. So Is is everyone aware of what BPF is who doesn't know? Who knows but wants me to explain because of other people in the room Okay, so BPF is a way to run inside the kernel To run use a space card basically inside the canal So why do you BPF program and you can load it in the kernel and it's running within the kernel context and in a secure environment? So the program you load and you run in the kernel is verified for some specific constraints And it then run within the kernel itself originally BPF Which is actually called eBPF the one we use nowadays BPF was Designed for packet filtering It was created by IxA Starvoid off in 2014 And it can be used To perform very efficient packet filtering and we'll try to use that Specificities to improve IP tables now. So how can we leverage it? So Let's start with defining a new kernel module. Let's call it BP filter for example That module is a bit specific. It's not like a normal kernel module. It's a UMH A UMH is a kind of module that will start a user space process from within the kernel That has a lot of different benefits. It's not much used anymore. I'm not sure it was a lot But BP filter used it and the user mode helper so The kernel The module will be loaded into the kernel a new process user space process will be started from that module And the benefit is that you can use your user space development tool to work on that user space process So you can attach a debugger You can attach a provider the same way and with this with the same tool that we would use in user space and the good side is if you're doing something wrong and your Process crash the kernel doesn't really care about it like it's not making the kernel unstable So when we've got that What we can do now is modify the kernel So modify basically the IP underscore suck glue that see file that file is responsible for Mapping a get suck up or set suck up call to a specific function so What we want to do is modify that file so the IP tables Get suck up and set suck up call are going to our new module and not to what they should go and The final part which is the biggest one is so at these points We are able to hijack the IP table call coming to the kernel and send those to the module But now we need to do something with it. Yeah No, and it's it's quite a high level of a view so Yeah, so our module is receiving the content of the like the payload of the get suck up and set suck up call and now what we want to do is to simply Use that payload convert it into BPF instruction and load it into the kernel That's not the easy easier step, but when you do that that works So That was the theory when the project started a long time ago And from there there were a few patches submitted to the kernel So the first one was from Alex I star over it off Dave Miller and Daniel Bochman Which are maintainers from the kernel from different subsystem and they merged the basic boilerplate of it, which is Creating a process having the the kernel module And modifying sagaloo to send the payload into the BP filter module Eventually Dimitri Benchikov worked on it too and added the whole capability to convert the payload into BPF wildcodes So he submitted a couple of patch series and none of them none of them were merged And he moved on and eventually I took over the project and I tried to submit a v3 The issue is that we quickly realized that Having BP filter defined this way as a user-made helper in the kernel within a module Can lead to some issues which is for example, you wouldn't know IP tables, but not NF tables or other way to describe filtering rules And also it's tightly tied to the way the kernel is developed The issue with that is because BP filter is a user space process So it's user space codes and the kernel maintainers Don't have much time to review code first. So when it's user space code is seven harder, especially for a new project that is Developing quite fast So we decided to move it to user space and what it looks like now is basically this So you would have the use space and kernel space We can see BP filter in orange Sorry for the color blinds then So the client will be IP tables or NF table for example, and that client is linked to LibBP filter The library is a small lightweight library used simply to interface with the client and to easily communicate with the daemon And the daemon will be responsible for the whole like heavy lifting convert to the program understanding the the client's data the set of rules and also loading the BPF program Into the kernel to the different hooks So we talked about IP tables and let's talk a bit about NF tables now because it's gonna be relevant at this point IP tables work differently from NF tables And if IP table sorry Sends the whole rules into the kernel and the kernel will interpret the rule and for each packet go through each rule And apply it check what matches or not and act on it And if tables is different because NF table relies on a VM running inside the kernel So what NFT the user space tool for an F table does is that it treats the content of the command line converted to Netfilter by code and send it to the kernel. So that's what we can see here in orange It's what's going over the net link sockets from NF table to the kernel so it If we take some time we can understand what's going on here But it's it's very close to that bytecode assembly like we see kind of like the same words and stuff So Yeah, so that's one of the main difference It chooses bytecode so each one inside the VM in the kernel and the other one is being That's NFT use net link So if we go back to the BP filter now with our NFT example So Libby filter is linked to NF table the NFT binary And NFT will give it the Bicode basically what it would send to the kernel is not sent over the socket It's sent to it's given to Libby filter instead and Libby filter won't do any Manipulation of the data. It doesn't do much except for sending it over a unique domain socket to the demon which we see At the bottom of the slide So it drops the bytecode puts all of that into Message with some metadata like where is the message coming from what kind of data is inside and send it to the demon Because of this we can have a very lightweight and easy to integrate Libby filter and that was the point of it So now that the demon is receiving the data So at the top of the slide we can see that the data is coming from Libby filter from a client And the demon has different parts inside of it So the translation part will be responsible for converting the client specific data So NF tables or IP table data into a generic format used within the demon itself It's not aimed to be used outside. It's just internal stuff When the translation is done We can go to the generation which create the BPF program and when we have the programs We can load them to the kernel and they will start filtering packets So the translation is Specific to the clients. So if we use an F table, there is a what's called the front end within BP filter that will receive the the net filter bytecode and convert it into the format used within BP filter having that architecture is quite useful because We can use when we have the generic data at the bottom and we want to create the BPF program We can use the same function wherever the data is coming from it makes it easy to Back up the The rules and save it somewhere and to save the state of the demon and it's also a way to later on be able to optimize the rulesets We can imagine that for example If you define a hundred worlds to filter a hundred different IPs BP filter would be able to put all that rules into a Set and filter on the set directly instead of having hundred worlds defines in the BPA program so we have using this this mechanism of Converted converted the client specific format into generic format. We have a lot of different things possible later so When we have the generic format here, we can start creating the BPF program I'm not sure if you can see from the back there. So I'm sorry about that So we have the generic the generic rules inside BP filter format and we can start creating the various BPF programs That will do the same processing as the rules so That's basically a compil compilation step we go from a specific format into BPF assembly BPF instructions and It starts with what's called prologue So BP filter is able to generate different different tip different types of BPF program whether it's XDP TC or BPF net filter, which is more new And because you want to generate the rules the same way Whatever the program is, whatever the program type is Then you would have to have a prologue at the beginning to ensure that the arguments and the context is set at the same way for the different programs For those who don't know about BPF that much Basically different program types allow you to attach to different location in the kernel But this program will have different arguments when run And so we need to go from those different arguments and ensure that what we work on is the same for every program type So when we've done the prologue, we can start setting up the environment like setting up like a context Creating a dynamic pointer to access the packet data and avoid running like reading outside the packet for example We can continue with Unrolling the rules so we have ten rules for example, and we'll convert all the all those rules into BPF bytecode One of the last step will be to create the policy. So the policy is the default rule for a chain The policy will see we'll say for example, if no rule match, then I will be dropping every packet coming and Finally BP filter is able to generate custom BPF functions so By the end of the program generate it's able to create new function that can be called in the main program later That's useful to avoid duplicating code and ensuring that the BPF bytecode generated is as small as possible And finally the epilogue which is basically the same thing as the prologue but for the return code So XDP and TC for example as have different opcodes meaning we accept or we drop the packet So we need to be sure that if we want to accept it We put the right opcode in a return value for XDP or for TC And when the whole generation step is complete, we can then see that if it's not too small There is now a program in the structure returned by That step So we now have our program available and we can just do the last step which is loading the program and attaching into a specific hook so Sorry, so yeah BP filter use the BPF subsystem in the kernel to load and attach the program There would be One program for each interface Except for the loopback one. This allows you to filter on the interface Without having a unique program which would have to check the interface the packet is coming from So that save us from some branching if you use a lot of per interface filtering And the program is replaced atomically so if you have a program running for a specific hook already for Your front end. So if you create a filtering rule using NF table On the pre-routing hook for example, and then you want to add another rule Then that program is updated and it's done atomically to avoid any downtime in the filtering So I've got a Demonstration if you want to see Okay, is it big enough? No, I can't see what I'm doing right now. Yeah, I see that What is going on? Okay, my keyboard was in French my life. Okay, that's better Skip my test. Okay, is it better in the back for the right window? Okay, so I should I'll be honest. I don't see anything Let me just do something quickly. Okay, I can see it now So so let's start first. Let's start baby filter. So the demon we need to start it, right? We see a few options here. The first one is transient for those who have used system D Maybe that rings a bell Basically, what this does is that it says to the demon that don't keep any file on the file system for the short explanation the long one is BP filter by default will pin the BPF programs to the BPF subsystem BPF file system, sorry That allows the program the demon to be restarted without losing the filtering The turn on states will say Well, don't pin anything if we stop the demon then you remove the programs I guess the verb was one is quite explicit and the last one is to disable IP table support So in that case, we would just have NF table support So the demon is started it set up NF table for that's right so that demon now is running on a VM and What we're gonna do is I've got the terminal here on the bottom left trying to ping my VM BP filter running in the VM and we will run We will add a rule to filter out ICMP packets within within the VM So I should have yeah, I don't have to remember the whole comments. So, okay We start we just started a demon. So there's nothing here if we list the content of the NF table rule set There is as we can see nothing is at the table Now what we want to do is to add a chain So a few things happen here. First of all, we create the chain We want to call it pre-routing. We add it to our table It's a chain of type filter whatever whatever and we can see a default policy The last argument. So by default the chain will accept every packet Another thing here. We can see the dash dash BPF option that option is from actually a fork of an F table I've got which means Use BP filter. So instead of sending the data to the channel with netlink It will just send the data with the BP filter to the demon And on the right side what we can see that we have the content of a code gen structure Which is a structure that go through all the steps of translation generation and loading It contains now a default policy, which is the one we set when creating the chain and It also has two programs one for each interface and because I've got two interface I've got two programs except for the loopback one And we can see the program as a name. It has a map attached which contains metadata So each program created and loaded will have a map which contain metadata and We can see it's not pinned anywhere. So that's because of the transient option Now we want to add a rule to it so that rule will Tell an F table to drop every ICMP packets and we see the counter option and the counter option Will tell an F table to create counters for the number of packets and bytes matched by the rule Well, it doesn't crush. I guess it works, right? And we can see here. There's no packets. There is no ping working anymore. So we actually catch the ping and drop it But that's not just it, right? I want to see what's happening now So if we dump the rules that we can see there is a new rule And we can see that it's actually updating and we can see the number of packets and bytes filter out are increasing And because BP filter creates like any other BPF program we can use BPF tool To see what's going on. So we see two different programs Which have the same name as the name found in the log on the right and if we can if we want to have a look Let's see what's inside of it. So that's the BPF by code So I just printed the content of one of the program used to filter and that's basically what BP filter is creating Each instruction here is a structure and BP filter will fill a memory buffer with that structure to have the proper program in the end and if we look closely We can see this part here So This part here is the rule we just added to drop ICMP packets It starts line 87 and what it does is that it loads from the packets the protocol field When it's got the protocol field in Register 4 it's gonna check if it's the value 1 which is the ICMP protocol if it's not Equal then it will just jump to the next rule if it's equal you can continue and call custom functions which will update the counter It will then put the value which mean drop for an XDP program in the return register and it exit the program and Because the BPF subsystem will see that a written value of the program is 1 and because it's an XDP program It knows that it needs to drop the packets Any question on this part? Before I go back to the slides. So that was the demonstration Let's talk about performance is a bit so before discussing any further I Must say that first of all it's hard to do benchmarks, right? To do to have like meaningful values from benchmark. It's not easy the other thing that this one is specifically synthetic What we do here basically is We have two hosts Connected with a 10 gig link so we can go up to 10 gigabits between the hosts The first one will use package and the kernel module to generate traffic fake traffic up to 10 gigabit per second and the other one will just receive it and Have a rule to drop all the coming traffic on the interface And what we'll do is that add more and more rule before that dropping rule the rules before are just useless We just want to go through it and see the overhead of every rule but I am It I say you need to take this one with a grain of salt and it's synthetic because You wouldn't do this actually Maybe you would do this this IP table, but not with a nice table what you would do instead of having a pattern rules to filter And father of IPs you would have one rule to filter on a set of IPs But I've used this one anyway because BP filter doesn't yet support sets of rules so we can say he is like Five IPs and you need to filter on hold the IPs So it I wanted to compare BP filter and F tables Apples to apples. So that was the only way to do it for now And it that's giving us some value So we can see that NF tables and IP tables are dropping in bandwidth Earlier than BP filter So what we are doing here is for IP tables and NF tables that we add the rules to the priority hook And BP filter is adding the rules creating the program to be attached to the XDP hook for that explains the difference instead of having to So because we attached the program to XDP the program is filtered as soon as it arrives on the nick and The kind of doesn't have to allocate any memory for it For IP tables and NF tables the hook is located later So at this point the cannot had to go through like rooting well not rooting because it's pre-routing that I It has to go through allocation of memory to store the packets and to go Through the stack the network system stack of the packets of the kennel story But another thing that's interesting here is that by default even with just one rule which is drop every packet BP filter is faster than IP tables and NF tables and the difference here over 10 gigabits is around 200 megabits And that difference is due to again XDP because we don't have to allocate memory for packet We can just drop it as soon as it arrives so What now that we talked about it a bit more what can it actually do right now? So for now, I've got two forks one of IP tables and one of NF tables and those two forks are able to use BP filter with a dash dash BPF flag and for IP tables and IP tables legacy and NF table It's able to filter packets based on the source or destination IP same for the ports Filter and protocol as we've seen just before you can filter on the source interface also and It can collect statistics from the various rules defined So it can create SDP TC and BPF net filter program types the BPF net filter program types is quite new It's from last April And it allows the BPF program to be attached to the same looks as IP table And finally the program defined will use kfunk in BPF help us they can create custom functions And they use dynamic pointers to avoid reading the packets directly And what is going on so IPv6? So we are coming back to the IPv4 question from before so I'm working on support for IPv6 right now Support for sets so I would be able to compare NF tables to IP to BP filter in a more Useful way, let's say Also Generation of the rule partial generation of the rule to avoid regenerating the whole rule set every time. It's not It's quite efficient already Like it's not like you have to wait for the generation to be done Me but it would be better to Not have to do that every time and just if we made the rule and we translated the rule One time we don't need to do it more. We just should store it and reuse it later And I'd like to add a generic client to be able to Not be constrained by IP tables or NF table to be able to use any hook you want and any feature you want and Finally C group support to attach program to C group directly Some links if you want to have a look So there is the BP filter repository The forks for NF tables and IP tables I'll try to post some status Reports and updates of the project on my website and finally if you've got question. I've got my email here Do you have any question? This is really cool, and I'm curious about the what are the hurdles that you're facing to get the forks of IP tables and NF tables Your changes upstream. Are there any roadblockers that are making it difficult? To get that stuff upstream Yeah, to get it upstreamed So I'm working with the NF table people. I've discussed with Bestfile which is a maintainer of NF table that filter and He was interested into it for xdp of flooding So being able to use BP filter transparently from NFT to put your rules inside an next DP BPF program Which would run earlier than anything in the nap table and and the benefit of it is quite transparent, I mean you wouldn't have to allocate memory for the packet and you can drop for example if you have to mitigate Traffic attacks and the kind of thing you would drop the route the packets as early as possible. So So it's in it's in progress. I would say okay. Thank you if you have Set up one of these BPF programs do filtering and you have existing NFT rules Will they layer or will it replace it? So for now it exists together Basically, so if you create a rule without the dash dash BPF option you update NF tables If you add the dash dash BPF you create a BPF program The NF table integration Should be better and hopefully will be when working with the NF table people for now it's It's not a proof-of-concept, but it's an idea of how it could integrate with it Why do you exactly need the daemon separately rather than just having like the NFT program call into a library that can do all a Generation and then load it straight into the kernel because you pretty much have to run NFTs route anyways Yeah, no privilege separation required for that the the issue that you need to keep states So when I let's go back to the Okay, so this part When I resubed the net filter by code I need to translate it into the internal format and I don't want to do it the other way around So I keep it aside and I update it if needed, but I don't want to just recreate it so I serve it into the context of the daemon and It's seven words with IP table legacy because the format is very very specific and it's difficult to go back from BPF to that format so that's a way to That's a way to save time and effort and and not to pull my hair trying to go back the other way Another solution would have been to have just a library and to create a BPF map to store that kind of data But it's not made for it So that was the best solution I could find Okay, I mean another option might be to have like on this cash area that the library just knows where to look Yeah, but that's Like where the right way to put the cash is it Meaningful to have like the cash and and that data stored in the cash itself and I Don't know if It's a better solution That's the one I picked And it works fine for me. What's xdp? Xdp is so in that case it's the hook Located on the network interface So when you attach a program to xdp you attach it to the network interface directly So you have the packet as early as possible when it's still in the buffer of the network card. Thank you very much. Thank you