 So, hello, my name is Svendik, I'm a software engineer at Red Hat and I'm part of the Anaconda team and I've been the lead developer of the Anaconda back-end. So welcome to my presentation about the story of the Anaconda back-end. Let's start with a simple question, what is Anaconda? Anaconda is a system installer used by Federal Rail and other operating systems based on this and it kind of is doing three stuff. So usually when you want to install something in your system, you download an ISO, put it on your USB stick, boot it, and then what happens is that Anaconda configures the runtime configuration, so it modifies the installation environment based on some preferences and detective values. Then it tries to collect the user preferences about the final system, so you can select what kind of language would you like to see on your system, the time zone, you can choose some specific software you want to install, define the partitioning, and create some user accounts. And then we verify that this is valid and it looks like the new system should be available and bootable and usable, and after that we allow to start the system installation. And this is important because why we collect the user preferences, we don't really touch any of your disks and we don't modify your data, but once you confirm that you want to start the installation, that's where all the magic happens and we use your user preferences to do some real actions on your hardware. So quickly how does it look? So the runtime configuration usually happens in very early stages. You usually don't even see the screen because it happens very quickly. Then eventually you end up on something like this where you can specify what you want to do. As I said, nothing clearly happens, we just collect your data and when we have all that we need, we enable the begin installation button. And then we actually do the real action. By mentioning this, Anaconda is very difficult because of this part, like the part where we just collect stuff, but we cannot actually do some stuff because we have to kind of simulate what's going to happen and how the system is going to look like, and that's not easy. And also you can like visit and configure anything in a very random way. So you can start with the user account and then specify the software and then specify the partitioning, but in reality what happens next is that we start with the storage, download the software, and then we start to create some configuration. So yeah, this is the difficult part. So what's the actual Anaconda modernization? So this is the initiative that started in 2017. And the idea was pretty simple. So we started with something like this, and that was one huge monolithic Python application that had everything in one process. So the user interface had access to all the data that you might need or you might not need everything could use anything. And it was all very interconnected and it was just a huge mess. So the idea was to separate this, that data and business logic into debug services and have the user interface and the debug services and the debug services can talk to each other and use the interface can talk to debug services. What will you gain with this is that eventually you can replace the user interface with something that's not even based on Python. And for example, have the web-based UI, which we might have heard about this year. So yeah, the idea of the web UI is very old actually and it started here in this year because it was the ultimate goal to get there. So that was the plan, what was the reality. So this is kind of what you have. Anaconda is a very old project and it's not pretty. So there are no nice boxes that you could just take and move to some service. You have to create these nice boxes first. So you just start somewhere. So let's say you are able to identify a piece of code that looks kind of isolated. And maybe you might be able to move it on Divas. So the next thing you have to check is what kind of data does it use. And if there is anything that's not really on Divas yet, you cannot really touch this logic because you would miss the data. So instead of that, you target the data, see where it's used. And migrate the data first, make sure that all the code that uses this data are using the Divas API to gather them and change them. And then you actually can go back to the piece of code that you wanted to move and move it. So based on that, we targeted the data. So Anaconda had a lot of weird data objects for some reason. So yeah, we had a lot of global variables that was lovely. But we eventually got rid of them. Another weird object was the Kickstar data. So Anaconda has this special mode when you can automatically run the installation using a Kickstar file. And Kickstar data is like a Python representation of this Kickstar file. But the Kickstar data are used also for interactive installation because we use it as a main data holder of your preferences. Which doesn't make sense because Kickstar doesn't support everything that Anaconda does. So we select all four crowns and that causes a lot of issues. Another funny object that we have is the storage model. So as I said, the storage model is, yeah. So we have to somehow like simulate the actions first. So with the storage model, we have a Python representation of your device tree. And we do actions on this device tree and we just check the result. And when we are happy, we actually apply these changes to your real storage. But unfortunately, our storage model didn't have an undo button. But sometimes you had to like reset because you ended up with a very invalid model. But this object was already propagated to all corners of the UI. So you couldn't just throw it away and create a new one. You had to somehow reset it. And that was also an issue that we solved later in the regularization. Then we had a representation of the payload object. By payload, we call all the support that we need for installation software on your system. And then there is a special category of product data where every product can have a little different defaults that they want to show users. So that's also something that we need to take into account. So the most problematic part was the Kickstarter data, actually. So we decided to start the planning around that. And we collected all the Kickstarter comments that are supported by Anaconda. And we split them to some areas that made sense. And this is basically the foundation of the D-Bus modules. And you finally have something to work on. It has very clear goals. And you just go module by module, command by command, and just try to figure out how to handle this command via the D-Bus module. So this was the plan. The phase one targeted the system installation. So everything that you needed to actually finish the installation. And the phase one was actually finished on April. So yay, we are done with this one. The second phase is going to target the runtime configuration. And the second phase is about the runtime module and the bus module. The bus module is kind of special because its main purpose is to orchestrate other modules and send them some data and collect some data from them. And basically it oversees the whole D-Bus API. So what kind of challenges did we face? So the first question was, where will we develop this code? And we had two options. The first one was to have a development branch, which would be separate from the production branch. And it would be nice because if we make a mistake, it will not affect any critical workflows like Fedora, Rohit. Unfortunately it means that until you release this thing, you don't get any feedback about what you are doing. So you don't really know if your idea of how it should work really works in the real world. Another thing that we were afraid of was to keep it in think with the production branch. Because this project didn't have a very high priority and there was always like new stuff and features and requests that were coming up and we still continued development of the production branch. And we didn't want to lose these features and backfixes, so we would have to port everything to the development branch. And we knew that that would be just too much work. So the other option was to use the production branch. And that has a lot of benefits. But unfortunately you can very easily break Fedora, Rohit, and you don't want to do that. So there was like a lot of pressure about making it right. And we were doing very thorough request reviews and we spent most of the time on just making sure that we didn't forget any use cases before we actually did some change. And that was very difficult to do. But I think like considering the amount of work that we did, we didn't mess up so much. Yeah. So what we had and have for all these years was kind of a hybrid solution. Because I mean some of the Kickstarter commands were migrated on the debuts modules but a lot of them were still handled by the user interface. And that created a lot of interesting situations and challenges because you somehow needed to take this Kickstarter file, tear it to pieces, send it to the right components, collect the feedback about possible issues and validation errors, and later again collect new pieces of this Kickstarter file and generate an output Kickstarter file. Another thing that we had to make sure is that there is no overlap that we didn't forget to remove the management of one Kickstarter command from the user interface while it's already handled by another debuts module. So we wrote a lot of unit tests for that to make sure that this is fine. Another challenge that we had was how to quickly and safely develop the debuts API. I don't know if you ever read the debuts specification. Don't. You will. It's very difficult to understand and grasp. So we needed to make sure that what we are writing is right and that we don't have to spend a lot of time on these little tweaks and weirdness of the debuts API and that we can focus on the code. And we knew that we will develop this very iterative way and that it's going to be a lot of refactorization of the debuts API. And for example, one of the things that you need to provide at some point is like XML specification of your debuts object and that's really not something that you want to refactorize because it would work. So what we actually did, we started with the PyDebus library but we built a lot of functionality around that that would simplify a lot of stuff for us and eventually we throw away the PyDebus library and create it a new one which solved some other issues that we had with the thing, for example, and put there all the new support that we created and now it's available on PIP and other operating systems so you can use it if you want to. So another issue that we had was the management of the default values. So the problem with defaults is that we have a lot of sources of the default values. I mean like Anaconda has some ideas what are the defaults but then the products have some like other ideas what the default should be and then you have like these kernel arguments and boot options that can override them and we didn't want to like propagate all these sources to the debuts modules so they can like pick one. So instead of that, we introduced the Anaconda configuration files that are just text based and we in very early stages process all the sources at some point generate a temporary configuration file and then we start the debuts modules and the first thing that the debuts module is doing is that it's looking for this new runtime configuration and it's using only this one so it doesn't have to care about the other sources which helped us a lot. So and so final question was how can we test this thing? So the main goal was to be able to test the backend with unit tests very easily and unfortunately and luckily we were able to do that with the libraries that we had so we didn't have to like do any weirdness with debuts demons and we didn't have to like test related debuts API we could just create these Python representations of the debuts API and unit test them directly which simplified a lot of things. What we also were focusing on was the end-to-end testing. And with Anaconda, Anaconda is very difficult to test so this was like the best effort we could do and that was the focus on kick-start test but that was actually good because we were targeting kick-start commands so it makes sense when we was migrating a kick-start command like auto part and it wasn't covered by these end-to-end tests we could try to new end-to-end tests to make sure that it's covered forever. So we spent a lot of time on this and improved infrastructure a lot and it was also great. So what's the current situation? So as I said we finished the first phase and this is the code distribution of our current modules as you can see storage kind of consumes like most of that code then it's payload network and the other modules are pretty small. So you can guess that we spent years on the storage development and another years on the Pails module and then the network and the other stuff was pretty much easy. So here are some milestones and yeah it took forever but as I tried to explain there were all these reasons for that. So what were the benefits and was it like even worse to do this horrible mega thing? So we have a pretty good code coverage of the new code. So this number is a little confusing but we can check actually how does it look like right now and when we focus on the PyAnaconda library which holds more of the most of the Python code you can see that the core package that's like a general library of functionality has a pretty good coverage and the modules have a pretty good coverage and UI yes we don't have tests for UI so obviously not so well. And the modules itself most of them have a very high code coverage except for the ones that are really big like the network module, the storage module and I forgot the last one payload. Oh payload is pretty good. So yeah I think we did a pretty good job and when we started I think the code coverage we didn't even measure code coverage but I think it was around 20% and it wasn't great. So what's next? So the number of end-to-end tests that we run daily on Fedora row height with our upstream changes is over a thousand which is a lot because when we started these tests didn't even work properly so like having these run daily makes me sleep at night. So another like side effect of these things was that we kind of by accident stabilized the GUI end to like the current user interfaces because when you are modifying the user interface to be able to interact with the D-Bus API it means you have to touch the code and you have to test it and if you find a bug it doesn't make sense to ignore the bug and leave it there you fix it. So we were like finding a lot of issues there and fixing like a lot of bugs just by like working with this piece of code and then when we had the bug reports for example for REL we noticed okay but we saw this on upstream actually and we fixed it so it was very easy to just port the fix to REL and it didn't cost us any additional work so this was great. Another benefit of the development of the backend is that it enabled the development of the web UI because web UI wouldn't be even possible if it didn't have something to talk to. So this was kind of crucial part of that and the fact that we can work on this is great but as I said this all started years ago. Another thing that we improved is the simplified way of customizing Anaconda for products. As I mentioned we had some issues with that and we had to handle the defaults differently and side effect of that is that it's very easy to like provide new defaults for your project and you don't have to understand Python it's no longer a weird Python class that's breaking all the time it's just a simple text file that everyone can understand and change and you can find it in our repository. Another thing that we this was intentional was that the support for add-ons is much better on the on the backend level and basically there's almost no no difference between add-on D-Bus modules and our D-Bus modules because they use the same basic like the base API and we treat them the same so we were able to like remove some weirdness around it and make sure that it's easier to develop. Also add-ons can now be developed in other languages if you are interested because it's D-Bus and we don't really care what's running behind it so that's another nice thing. And one controversial thing I want to mention is that so since we kind of like dropping the dependency on the Kickstarter data object the data holder and using Kickstarter just for input and it's no longer just there to hold some data it means that there's a possibility to support more formats or maybe switch the format or stuff like that because we don't really depend on packing starts so much anymore and I personally don't like the current for so if anyone is interested this is definitely something to think about because I think we could do much better. So what's the future of this? So right now my colleague is working on the Phase 2 which means he's writing D-Bus support for the runtime configuration. In the future we want to clean up and stabilize the D-Bus API because it's a little still graph and messy still it was coded was developed for six years so some of the early parts are not so nice as the later parts and also we have zero documentation of the D-Bus API which is not great. Unfortunately all the resources are currently working on the web UI so I cannot promise you when we will have documentation of the D-Bus API but we will get there. Yeah so that's all from me I just very quickly because I have time so like one of the side effects of this was that we created some new libraries as I mentioned we have the D-Bus library for the D-Bus communication and there's another library called Simpleline and that's actually a very simple Python framework for the text-based user interface because we don't have only the graphical user interface we only also have the text-based user interface and because of some weirdness okay 390 we couldn't use existing libraries so this is kind of like separated code that used to be for a very long time in Anaconda and we cleaned up and created a completely new library that's independent on Anaconda so you can also use it if you are interested in this thing here are some additional info as I said don't look for the D-Bus API in our recommendation it's not there if you have any questions which are some metrics this is the best way how to get any answers yeah does anyone has any questions yeah okay so the question sorry yeah yeah so the question was if we will consider the new format instead of a kickstart have you considered to use like the existing formats that are used by other installers so yeah this is not really like initiative this is like an idea that I'm like throwing out for someone to who's interested but yeah it would definitely make sense to to look at what's there first I personally would like to be able to use uncivil but we kind of looked into that and it would be so difficult to support it but yeah it makes sense to like try to unify the formats and and make it happen yeah okay so the question was difficult so kind of if we if you considered to create the client a simple client that was allowed to run the processes first and then write the web UI based on this client's tool right yeah okay so no so basically like yeah you kind of have like very limited resources so you have to like explain everything what you are doing and there wasn't really like a need for this kind of tool because there are like other tools who are better at just like doing the installation in an interactive way and Anaconda's focus is the interactive installation so like it's definitely possible to use the backend and write a very simple tool that will just use the backend without any like user interface but there was no demand for that so we didn't really explore this area yeah yeah definitely yeah well there was no demand like no one asked for it so we couldn't just go and like write this piece of code that no one really needed at the point another question yeah okay so the question was about the web UI and how the web UI is actually talking to the bus services which is a very good question and yes we are using a cockpit bridge and basically the whole cockpit setup to communicate with the D bus services because the support is already there it was very easy to reuse for this use case well you are using all of it because yeah maybe you were not there so basically we were like doing this iterative way so we were like do we were doing the migration over the years iteratively so there was more and more backend and since we finished phase 2 it means that most of the or all the modules that are related to the system installation are basically finished so all of that is running via D bus the missing parts are just just very simple H cases that are related to the runtime configuration and that's currently being worked on but otherwise it's all on the bus and it mostly was D bus for years yeah yeah so the question was when we will be able to leave the hybrid solution and fully switch to the D bus modules and that should be we should be able to do that at the end of the phase 2 which is currently under development because the phase 2 is targeting the missing kickstart commands and the ones all of these commands are handled by D bus modules we can like drop all the support in the user interface and the user interface should use only the data in these D bus modules and it's also critical for the web UI as well because it cannot really access any of these data that are not available on the bus A few months of development that will develop with the code coverage to 90% would set a lot of stress from all these scenarios in the way what development is there, some of the modernization in like way yeah so basically I guess the question was why the coverage isn't higher do you mean the unit tests or the end-to-end tests? okay so the code coverage is related to unit tests and you really don't want to write unit tests for something that you are going to refactorize like three weeks later so we as we did the as we did as we moved a piece of code to the module we tried to cover it with unit tests but since Anaconda is very complicated and it's a lot of code really it's really like 100,000 lines of Python code it's it's not so easy to cover all edge cases so I would say we did our best and this is what we have but there's definitely space for improvement and it can be better okay so the question was if we were considering dropping some difficult support and how was the process of doing that so what we were able to do is that we very often found something that didn't work for years so we kind of silently dropped it because no one was complaining that it doesn't work and but sometimes there were people who lately later noticed so we had to like say okay it's gone sorry we are not reviving that about like specifically targeting difficult parts I guess it was it never was on the table I guess we always try to support all the use cases that we had which maybe we shouldn't because yeah it was sort of more but yeah yeah that was kind of kind of it we didn't really try to like target some stuff that's where are they working because he were afraid that someone will complain that okay but this was to work and now it's not there and you did the horrible yeah here maybe we shouldn't and then the question okay so thank you so much for coming I hope that you think thank you hopefully you can hear us welcome to Defconn welcome to our session so my name is Victor this our team we're both from Redhead from the core kernel team and we'd like to tell you something about what happened in in the BPF world in the past let's say one two years let me start with some a bit of statistics you can see the number of commits and the number of changes in the BPF subsystem in the kernel throughout the past two years and you can clearly see that the trend is rising so BPF is really getting a lot of attention upstream there's a lot of new work appearing practically every day when we at the beginning when we set up this talk we put down everything that we found interesting that happened in the past two years and the overall talk took like something more than one hour so we had to pretty much cut it to less than a half to match the the schedule so we won't be covering everything but we'll be covering the most interesting parts from out point point of view which are especially interesting to people who try to develop their BPF applications so maybe let me first start with a quick poll how many of you have ever written a BPF application in any any sort in the sound framework whatever okay and much cool so this talk will be quite technical so hopefully you have at least some basic knowledge of BPF let me first maybe do a quick recapitulation what is BPF or eBPF I'm gonna I'm gonna use these two terms like one today eBPF equals BPF so basically it's an in kernel let's say virtual machine which allows you to run your own programs inside the kernel with current privileges those programs are written using so-called BPF instructions it's a special language for that and one of the most interesting parts about this is that there's a component called BPF verifier which checks every program that you try to load into the BPF sub system into the kernel and it checks that the program that you're trying to run or you will be running is safe to be run inside a kernel so it doesn't crash the kernel obviously it doesn't hang in the kernel so the program is terminate and so on this is quite a strict component and we'll be talking about it a lot today so that's why I'm mentioning it so once you have your eBPF program loaded into the kernel you can attach it to various events in the kernel such as K probes sockets for some network filtering C groups and so on there's so many probes that you can or there's so many events that you can attach BPF programs to these days okay I'm gonna or we're gonna split the talk into several parts and the first part is we'd like to introduce you some new features that you have as BPF program developers that you have available and you can use to write your BPF programs or you enhance your BPF programs so first of all one of the most interesting things that happened in past years in BPF are so called BPF kernel functions if you've ever written a BPF program you probably know that it's not easy to call kernel functions from BPF programs you can't call anything because obviously that would be too dangerous so the first way to approach these are so called BPF helpers where the kernel contains a list of functions which are allowed to be called from BPF programs they were quite difficult to be edit so that the list of helpers didn't grow much throughout the time but there was found that there's a necessity to come up with new functions that can be called from BPF programs that's why the concept of BPF kernel functions was created which are much more easy to create and edit you basically just need to annotate the function inside a kernel with this special annotation and register the function for the correct program type and you can then call the function from your BPF programs one of the nice things about Scaffans is that the verifier they allow the verifier to perform additional checks so that it can check that usage of these functions is safe inside your BPF program this is done through again another annotations I will be showing an example of those in the in the next section maybe an example would be great so we have one example by the way which was written by Artem here where he added a new key funk for calling the crash Kexec function it was as simple as this so he basically added this function to some list of functions and then this list of functions is registered to be allowed to be called from certain type of BPF programs what this allows you to do is that you can when you're writing your BPF program you just declare this function as extern and then you can actually easily call it from your BPF program which means in this case you can crush the kernel from your BPF program sounds interesting is that any way useful well quite quite as because you can crush the kernel in a controlled way you can let your BPF program simulate some situation that have that that can happen in production then crush the kernel which will obviously give you a crush dump and you can then analyze that crush dump afterwards and maybe found some problem that you wouldn't be able to simulate otherwise so this is one of the use cases for for K funks another concept of that appeared in BPF recently are so-called reference pointers one of the problems is that where the BPF programs work with pointers is complicated because the verifier had to check every time that the pointer the reference to get point you can only access memory which is available to you and there have been a lot of problems with accessing memory from BPF programs one of the ways to approach it here are so-called reference pointers which are actually implemented using the K funks that that's why I started with them and did basically allow K funks or functions from the kernel that your BPF program is calling to return you a pointer which you can then the reference and use it to access memory there are two kinds of these functions one are annotated with the acquire annotation which says that this K funks is returning a reference pointer and then the other one is tagged with the release annotation which says that this K funks is releasing a reference pointer these work roughly the same as the references in other languages so the verifier or there's a component which is counting the references to the pointers and you can be sure that their pointer is not freed or pointed memory is not freed during the time you have acquired this pointer if you didn't have this every access to memory from a BPF program had to be done through a special BPF call which would register an exception handler because the memory could have been freed and so on this mechanism allows a much easier and straightforward access to memory by actually holding a reference to the pointed memory which prevents the memory to be freed in the meantime while while you hold this reference so as I said to the verifier is checking that this reference is always valid but one of the problems that be was created by this is what if you want to acquire a pointer and then you want to use it from a different BPF program is that even possible luckily it is by yet another concept of so-called long-lived kernel pointers which are new kind of BPF pointers which have several features such as they must be strongly typed they are returned by the K funks or by the helpers and that may be stored inside BPF maps which is the most important feature because then you can acquire a pointer you can store it inside a map another BPF program can pick up the pointer from the map and the reference it and use that memory and you can be still sure that the memory is not freed in the meantime actually there are two kinds of long-lived pointers there are the reference ones which are more important or interesting but there are so unreferenced which still can be only accessed with property so this is basically plain pointers without reference counting which you can just store into the maps but you can you have to still use this special probe read call to access the memory however you can also use the reference pointer inside maps they can be safely the reference without probe read and they are automatically destroyed once the map is freed which is the nice part so there's sort of automatic garbage cleaning you have a pointer you you acquire it you pass it to a different BPF program through a map then this program works with it then it doesn't free it doesn't matter because once the map holding the pointers is freed the pointer will be automatically destroyed okay as I said there will be a lot of technicalities in this and as you look quite packed so sorry if I'm going too far oh sorry too fast anyway another concept that has always been trouble some in BPF is iteration as I said at the beginning one of the things that BPF verifiers checking is that the your BPF programs do not hang they must always terminate this is a problem it's been approached in a different ways throughout history first of all we basically said there can be no loops in BP in BPF programs which works it's efficient however it's quite too constraining as you can imagine the second approach was we allowed to unroll loops by the compiler which works again but it's quite impractical so another thing that came up was that BPF subsystem allowed fixed iteration loop loops so loops that had a known number of iterations but still they were quite difficult to verify because the verifier had to had to walk every path through that loop every program path through the loop and check that indeed the number of iterations is still fixed so one of the more most recent things that appeared in BPF is this BPF loop helper which resolved this problem because it's a new helper which has this this annotation which basically you pass it a number of iterations and you pass it a callback function and it will execute this callback function the number of iteration times for you which means that this is something that is very very easy to verify because the execution is not inside the BPF program the execution is handled by the BPF subsystem itself so if you can verify that the callback function terminates then automatically your your loop terminates and this is a very elegant and much a simple to verify way of writing loops in today BPF programs another thing that is problematic is that sometimes you want to iterate things that don't have a known number of items but you know that they are finite for instance you want to execute some BPF program for every process running in your system you don't know how many processes are there what you know is that there's a finite number of processes and this will terminate to resolve this there is this concept of BPF iterators it had been around for a while which are some special probes which are not attached to events but they are executed for each kernel object of certain time for instance for each task for each file for each VMA etc etc what is a new thing about BPF iterators are so-called generic iterators which allow you to very easily add new iterable objects by a concept which is very similar to iterators in for instance C++ where you just specify these four things first of all you define a structure which will hold the iterator state and then you just define three functions one for creating the iterator one for getting the next item in the iterator and one for destroying the iterator once your iteration is done and this way just by registering these four functions you can create a new iterable object inside the kernel quite an elegant thing so last thing in my part is the smootling attachment feature that has appeared in in past year or two one of the problems of BPF programs is that they sometimes take take time to attach especially if you're trying to attach one BPF program to many events such as to all the syscalls there are some 300 and something syscalls on standard machines and if you want to attach one BPF program to happen anytime a syscall is hit it can take some time so there is this new link type BPF trace k probe multi and this concept is implemented using f probes which is a bit similar to k probes if you know k probes in the kernel but it's built on top of f trace the difference from k probes is that it's available only for function entries and exits so you cannot attach to arbitrary instructions but it will it allows you to very quickly attach to multiple functions I have an example here where this is the I'm using the BPF trace tool to attach to all the syscalls you can see it's some 400 226 probes and while it took some 15 seconds to attach in the past version with these new with these new probes it takes less than a second to attach so a major major speed up when for for tools or programs that you want to attach to many many probes okay that's everything from my part we'll have to pass the mic so if you have a one quick we can ask one quick question why we'll do it okay so now let's talk a lot a bit about BPF inner workings first I'll talk about memory management and there were a couple of interesting developments here first is BPF specific memory allocator and located objects and link list set it enables and another one is BPF broke pack allocator BPF specific memory allocator was introduced by Alexei Srevoitov and it's used for dynamic allocation of memory in BPF programs obviously there is there are already a number of memory allocators in kernel but non-suited BPF programs well and that is because memory allocation in kernel depends heavily on the context it's ran from and BPF programs especially the tracing ones can be run from any context including NMI's so it's a common problem with memory allocation and there are common ways to deal with it one is known as memory pools and the idea is you pre-cache some memory in in a non-restrictive context and then use it when when the time comes so BPF memory allocator does exactly that it creates optionally per-cpu caches of objects of predefined size and manages them through IRQ work which is a relaxed context one of the issues with memory pools is over over-extending memory using more memory than is needed at the time so Alexei tried to remedy this with high and low water marks to keep the number of cached objects low the interface is pretty simple you get basically two pairs one to work with variable size objects and another one with predefined size and as you can as you might have seen this needs a reference to struck BF memory lock and you need to initialize it in advance and destroy it when you're done you can define whether you want it to be per CPU cache or not and the size of the objects okay there are some real life applications already the page that that introduced this allocator also switched hash map implementation dynamic hash map implementation to use it and it claims 10 times faster dynamic allocations now it also allows sleeping and tracing program BPF programs to use dynamic hash maps which wasn't possible before a bit later another page said switched BPF local storage to use this new allocator and that fixed deadlock problem when BPF local storage was used from tracing programs and it was caused by lock contention came out and one of the most important things that it enabled is located objects and linked lists so just a couple of months after BPF object BPF allocator was introduced another page that was posted and basically that allowed introduced allocated objects and increased the former allow BPF programs to define their own to allocate their own objects of a type in the program BTF BTF and and basically enable them to build complex data structures flexibly the former introduced linked lists that are single ownership and they can be put into maps or these located objects and hold such objects as elements as is usual with BPF everything is verified checked so they're supposed to be safe at least from the verify standpoint unfortunately interfaces for these are currently experimental so I'm not going to show any but if you want to head start you can look for examples in self-test BPF directory and also in that directory you can find the BPF experimental that H header file which is kind of a staging ground for this in development APIs next is BPF rock pack allocator unfortunately due to time constraints I had to cut it almost completely but the good news is that it's very well covered by WN and if it sounds interesting I encourage you to go through these articles they're very well written the idea of BPF rock pack allocator is to first save some memory but and that is achieved by packing multiple BPF programs into a single page before this every single BPF program was using a whole page memory page on x86 and memory page is 4 kilobytes and that is way more than usually BPF program is it also tried to improve performance by using huge pages and is enough to be pressure but due to bugs found in this patch set and also in in previous patches that this set and covered it is not the case and as of now BPF rock pack allocator is in is merchant working but it only packs BPF programs into a single normal page this also inspired some work on generic executable memory allocators and the first attempt wasn't that successful because it it wasn't available on all architectures and another important user executable memory in kernel is kernel modules so it couldn't be used for that so that's why it was decided not to merge it but just a couple of weeks before this talk another attempt was posted was posted it is called jeet slash text allocator and it is currently under discussion there are obviously some concerns but it looks much better another topic is BPF program signing this is a long list of small improvements and it started a long time ago the first page that was posted on April 2022 and we're still not there BPF programs might look similar to kernel modules at first glance they're both stored on disk as L files they both require relocations they both require memory allocation and so on but there is one important difference and that is in case of kernel modules kernel does all the work it understands the structure completely it does everything and in case of BPF lead BPF does a lot so by the time the code gets into memory it might be very different from what was on disk doesn't validate in the signature so to achieve problems the BPF program signing kernel needs to do more and there were a couple of approaches to this most of these were discovered or first to first was trying to move the whole lead BPF into the kernel and that didn't work because it's big and unwieldy then there was an idea to implement a new file format that would be understood by the kernel which was dropped in favor of new BPF program type which probably was easier to get into mainline so to understand what kernel needs to do we need to understand what lead BPF does its processes can be split into four main phases but only the first two are important in terms of program signing the open face is where the object file is parsed and where lead BPF learns about programs maps external functions and so on and the second phase is where the code changes actually occur here lead BPF probes kernel features applies relocations creates maps and so on everything needs to be done by the kernel one of the places where code changes occur is map creation before this change BPF programs accepts accessed maps through my file descriptors but those could be it could only be determined on on load on when the maps are created so instead of referencing them directly another abstraction was added and which is file descriptor arrays so now BPF programs reference indexes in this arrays and their ways are populated during program load another introduction is loger programs so you might know that everything BPF related in kernel is done through what a single c's BPF syscall the first argument is is a command ID and there's like 32 of them so everything map creation program loading attaching btf lookups everything is done through this syscall and the net result of lead BPF is a list of these syscalls the idea was to write those down and play them back when when the time comes and to do that a new BPF program type was introduced which could only call this c's BPF syscall and also syscall sysclose and could only be ran from user context but it's still a BPF program and we still need to load it that that that's where light skeletons come in the normal workflow with live BPF is that you create some BPF programs put it in a C file compile it get an object file that object file then gets parsed by BPF tool and you get a skeleton header file that skeleton header file contains the whole original elf object file and it also contains structures describing maps and programs and so on and it also contains a multiple multiple functions to work with those to load attaching so on and then you take that header file included in your user space program and work it work with it through this the light skeleton is different in that it does not contain the original elf file instead it includes the larger program that also means that we don't need Libbpf or Lib elf headers anymore so your user space program would not depend on those for your user space program the change is almost invisible because they are interchangeable the only thing is you cannot use Libbpf functions anymore and in most cases that means that you need to change the way you access map and program file descriptors because usually they access to Libbpf functions now you accept them through the skeleton structure all of this was just one single page set but it still missed a couple of important things one is score a support it is very important because it allows for greater BPF portability in this case the full source file that implemented it was changed so that it could be compiled for burst kernel and user space and this allowed to get more BPF programs through light skeleton and allowed to remove the BPF dependency from BPF preload UMD BPF preload is a number of BPF programs that are bundled with the kernel this also enables languages such as go to to use the full adventures of poor air before that they couldn't because they don't can't adopt Libbpf for whatever reason and the last thing is light skeleton in the kernel so instead of including light skeleton in a user space program you can include it in kernel now and this allowed to drop user mode driver from BPF preload completely and that means now we have BPF code in a kernel module which can be signed so we get BPF signed code only kind of because it's a module it lacks portability and so we are still away from from true BPF signing but I hope next time we talk about this it will be there that's it from my part of questions yeah well kernel documentation I think kernel documentation actually one of the important parts that we missed the question was what is the best source of information about BPF important parts we missed during this last year a lot of commits were devoted to kernel to documentation so documentation directory in in kernel is is very good yeah so there are also books on this by by the authors yeah the latest greatest is kernel source string and self-test directory what's the difference between k-probs and k-functions yeah yeah so you're speaking of a BPF trace tool specifically right so the k-probs are the standard they depend on which kernel mechanism they use to attach to the functions k-probs use the k-prob mechanism which has been in kernel for for a while while the k-funcs which are unfortunately named the same as the k-funcs in kernel but they are a different thing so the k-funcs in BPF trace map to f-entry f-exit probe types or program types in BPF which is a which are special BPF BPF special probes that allow you to attach to the beginning or to the end of a function they are much quicker than k-probs they have some more advantages such as access to function arguments and so on because they leverage on the BPF type information BTF yeah so the question was the BPF loop helper introduces an indirect call which is generally viewed as a not very good thing in in the kernel because it can introduce many problems and the question was if BPF is doing something to mitigate those if I go correctly yeah specifically for a performance perspective to be honest I don't know I don't have that much insight into into this to know I guess yes I expected that they are employing some mechanisms but I don't know of any particular one sorry yeah so the question is if there are any limits to stack depth in BPF programs am I correct there is so one of the things is that the BPF stack is quite limited it's like 512 bytes in total and I don't know the exact limit but there is a limit on the stack that you can have it's checked by the verifier the limits in general are quite strict in the in the verifier often more strict than developers would want them to be so the question is since BPF acts sort of like a virtual machine inside a kernel what is the overhead of of executing BPF programs and specifically with tracing what is difference using BPF programs for tracing then kernel tracing so basically it's called a VM and the programs are called that they are just in time compiled by but basically there's the translation from BPF instructions into native instructions is usually one to one I would say so the overhead is just basically looking into a table which tells you which instruction to to convert it to for the specific architecture so the overhead is like very very close to zero okay and we're out of time so thank you for the attention thank you for coming if any other questions just feel free to grab us on the on the corridors and discuss more thank you okay thank you very much sorry for the delay so my name is Adrian Moreno and I work for Red Hat for the networking team and I'm here today to present a new tool that I have written with my two colleagues Antoine Tenard and Paolo Valerio and we're gonna talk about networking and debugging and tracing so this is a short agenda I will try to go fast through these slides and run a live demo although given situation I don't know if I should just anyway let's just jump directly to the problem the problem that we're trying to solve is network visibility and network tracing and for at least for me this is a three-dimensional problem meaning we have on the one hand we have many components in the Linux kernel so we have we can have a packet in the TCP stack UDP stack TC net filter we can have it in OVS OVS is especially complicated because the packet temporarily goes to user space and then it gets re-injected into the kernel so it's there are many places where a packet can be but on each of those places we not only have the packets that we want to look to we want we have all sorts of packets so our packets our traffic is hidden among lots of other packets so we need good filtering but apart from that the packets mutate over time so filters get stale I you can filter on the source IP address and then the source IP address might change because you do something like nothing so it's really complicated you have these three dimensions to try to find your packets and know where they are and what happened to them in the kernel so looking at the existing tools we have of course venerable TCP dump we all love the PCAP filtering we you can express any filter and it just works it's it's like and it's of course the seed of ebpf so our respects we have all the tools like drop watch which is just focused on drops it gives you the stack trace of each drop it's really nice we have PWR you if there's someone that hasn't tried this tool I recommend you to use it it's awesome it probes many different places and and we also have tools like BPF trace perf system tap which are slightly complicated to say the least but very very powerful so with all this what is retis and what is the tool that we've written we can give a definition it's a tracing tool that gives contextual information of different places in the stack but you can also think of it as TCP dump plus PWR you plus perf times rust of course we haven't put all the features of these tools into ours but we do have we we have gotten a lot of inspiration lots of nice things that we like from these feet from these tools we have kind of integrated them so if you do know these tools you might feel that something you know sounds familiar so let's just jump into how to use retis and try to understand this tool and for us to understand this tool we just can jump into an example this is an example of retis and the output that it just brings okay so if we understand this we understand retis and what it does so first of all retis is based it has something that we call collectors collectors tell retis what data to extract so for instance we have the SKB collector there and the SKB collector collected information from the SK buff from the packet so that line which resembles TCP dump output is what the SKB collector generated apart we also have the NFT collector there the NFT collector generated that little line down there which is NF tables chain verdict and on table so collectors collect data and and extract information from the kernel and on the other hand we have something called probes probes tell retis where to look for packets some of the probes can be explicit so in this case we have over there an explicit probe so we told retis please get into the K probe IP RCV which is a well-known function in the IP stack and therefore that probe was well that that K probe was pro basically we attached a need to be a program there and collectors collected the available information so some of them are explicit but some of them are automatic like for instance the NFT pro the NFT collector the NFT collector you cannot just collect net net filter like NF table information from anywhere in the in the kernel so it it automatically added that probe in a special place in in in the kernel where this information is available so we have probes and collectors and if we start combining them we try to we we think we're able to achieve a fairly good low-level tool for for network tracing we have many existing collectors I'm gonna just quickly go through these slides because we we don't have much time because of the delay at the beginning we have the SKB collector collects packet information we have the NFT collector which also we shown in the in the first example you can filter it can do something really cool which is filter on the verdict of the net filter rule we have SKB drop extracts drop reasons from a special function in the kernel we have SKB tracking because we do extensive tracking of of packets so that is a very important feature in inside that is we have an OBS collector so just a small reminder of how OBS works for those of you that they might not be familiar in OBS we have a kernel data path but it's it acts like a cache so it's empty at the beginning and the first time we see a packet we send it to a user space demon where we process it and we determine what to do with it and then we put the packet back into the kernel alongside a flow that tells the kernel data path what to do with similar packets so then the next packet which looks similar we'll be just processing the kernel data path so we have an OBS collector which does exactly that traces add some automatic probes in the kernel data path and the user space demon in order to extract all this information there's a short summary table of of these collectors and we found a problem which is as we start adding probes and probes since they are explicit that many of them are explicit we end up having very long command lines and we need kernel knowledge to actually know what to probe right so these might be obvious for a net a kernel networking engineer but might not be obvious for for for the rest so we developed something called profiles so it's just a YAML file where we list the probes we enable the collectors and it's a very simple easy to share easy to ship in in your distro packets or whatever easy to write one for specific use cases and and have it ready for for your debugging sessions so you just enable the profile and that's it and one of my favorite features we have pickup filtering so same syntax as TCP dump exactly the same if it works in TCP dump it will most likely work in retis and it's it's like 19% I think like in for most common use cases and it will work and yeah retis gets that translates it to BPF translates BPF to eBPF and inserts it into a kernel for filtering and similar to perf a little bit events can be stored in files and for easy post processing and events in this case are just JSON so you can do any kind of post processing you can write your own post processing in in Python or whatever and and and get some more insights from from your events we have one one of the built-in post processors is the sort event the short command which I will show you in a live demo actually so fingers crossed bear with me so this is a very challenging okay so sorry I had to reboot the machine just seconds before starting so I hope my script works okay so I have a very simple setup here so we have two network devices attached to two network namespaces a private one and a public one and we're just nothing like masquerading between them okay so for instance if I I'm gonna enter the private the private network namespace and I'm going to ping so I have connectivity with the public one so I'm gonna just also verify this so this is the I'm thinking this IP address right and okay so I'm gonna capture the packet as as it's received and as you can see the the source IP address is not the source IP address that I have here so there's nothing going on okay so let's see how retis can can make it that pink flowing and so I have some profiles installed so I have a UDP profile I have also a generic profile NFT profile the generic profile is shipped with retis and it's with v1 at least and it's a generic pretty useful for starting debugging session so I'm gonna use the generic profile and the NFT one right and I'm going to collect I'm gonna filter on host 192.168.2 okay so that's interesting okay so a live demo of course is not working why is not okay okay sorry so sorry for that the I have a plan B which is I have a oh gosh but this is not visible is it do we see something there oh my god sorry sorry guys so can we do like zoom in I just rebooted this machine before and you guys see something well these are events okay so I just run the NFT the NFT collector and they sprinted a bunch of of events right and what I wanted to show you is this is just putting some events into a file called events.json okay so events are stored in a file and after that I can run the short the short command when I type short events.json retis short events.json this is the output so as you can see here at least visually I hope you can see that most of the events are indented this is because we detected the first packet and we identify that the rest of the events belong to the same packet and we identify that even through nothing so at some point the source IP address of the events change from here which is 192.168.102 and it and it become another one right so basically we with this demo I wanted to demonstrate that we can get events all around the network stack so IP receive we get the IP forwarding we get nothing we get much like not manipulate functions we see the nothing going on and we see the packet being received later on in this demo I increased the rate of the I increase the rate of the ping and I see that some packets are being dropped so doing the same experiment like doing collecting the events and shorting them I can look at the drop and I can see how you don't see it but at some point you see a drop in the NF table an NFT event dropping the packet which has a different source IP address that the one I put in the filter so even though I was filtering on the source IP address and the source IP address changed I was able to detect that and detect that that packet specifically that one got dropped because I had a net filter rule in egress which dropped and at the end I see this SKB drop event with the SKB drop event reason which is in this cases net filter drop also there's another way to see it there's an event we can enable just the NFT sorry just the SKB drop and the NFT probes sorry collectors and put the option minus-minus stack the minus-minus stack will will print the stack trace of each event so we will capture the event as they get the packets as they get dropped and print the stack trace this is similar to drop watch if you if you know the tool okay and here again pretty small I think sorry about that so in this other demo I have open v-switch setup and in the open v-switch setup just two namespaces communicate communicating each other through OBS and in this in this particular example I use UDP a DNS resolution so I run a DNS server on one side and a DNS client basically a big on the other side this is what I was telling you about so this is the cache the content of the flow cache okay the first time we see a packet flowing two entries appear two flows up here in the in the kernel cache for any other UDP packet will hit that flow and directly be output to the right port and after a while the cache entry gets invalidated expires and and gets flushed so in this particular example what we what we see is of course we we have a profile that helps helps in this use case the profile defined some traces in the UDP stack and some traces in the OBS stack profiles can be combined with each other so in this case we have two completely independent self-contained profiles that when combined will help us debug this particular case in which maybe we have some latency in UDP resolutions so we start collecting we execute the the example we stop collecting and we run again the short command and here the short command gives us a very big a path of events that belong to the same packet okay and of course it's not very visible here but at some point so we see packets in the the ETH port of the of the source network namespace we see the event sorry we see the SKB going from one name network namespace to another because we list also the network namespace in which we see the packet we see it being received by OBS we see it being processed by OBS we see it being up called by OBS to user space we see it being received by OBS in user space so we know that it wasn't dropped in the middle like we didn't overflow the netlink socket that we use we then saw a flow being put like a flow being configured because of this packet and we saw a flow being executed meaning the packet being re-injected into a kernel and then we see events where the kernel receives the packet and executes an action which in this case is an output action to port whatever and then the next event here is VEATH transmit again again goes back to the VEATH and up the IP stack and the UDP stack of the DNS server so we see the entire flow even in user space and of course if we scroll down to the next event okay so if we scroll down to the next event what we see is the next packet didn't go through user space and we see OBS DP execute action right after it being received by OBS meaning we are able to see which packets go to user space which packets stay in the kernel and we can see drops or any any unexpected behavior with packets even in user space okay yeah I'm really sorry I couldn't do this live but you know so I can show it here a little bigger so this is the output of the short command you see the the first line is the first time we see the packet anyway the first line is the first time we see the packet we see here in the IP RCV and then we see NF contract NF contract ICMP nothing going on right so we see all the events shorted it's like TCP dumping everywhere in the kernel and being able to see it in a nice shorted manner and what's next we have so we have just released the first version we have many many collectors in planned we want to add contract TC container integration we want an embedded python integration terminal user interface and whatever people suggest right so what contributions are welcome this that's the github repo and you can just create an issue and suggest profiles and any other feature that you would like us to work on and that's it sorry the demo didn't work last last-minute issues as always and that's it if there are any questions yes so there are different techniques so inside the kernel we trace the SKB the first time we use the SKB head and we track whenever the data pointer changes so we know when it mutates but essentially we use pointers inside the SKB to track when when when the event belongs to that same SKB when we in OVS we don't have a very good infrastructure for for for tracing packets from the kernel to user space so then we what we do is we get hash from different parts of the packet because we we don't have an SKB struct in OVS we just have the bytes right so we we hash it we use other techniques to track the packet through OVS because we there are several places where where OVS installs flows and and does things like that and then we also hash the packet when OVS re-injects the packet into the kernel so that's how we and then we kind of combine both tracking information into a single one for us to be able to short that all those events I don't think we can trace any user space application that can change the packet in any way like we cannot do it in a generic way that's what I mean we can do it in OVS because we know how OVS works I happen to work in the OVS team so we use OVS internal knowledge to know what OVS does and to the packet in addition to that we expose user defined trace points in in user statically defined trace points USDT so we allow in in OVS in user space so OVS has hooks for EBPF programs to you know to be run so this allows us to extract that information in very specific key places of course not all applications might do that and each let's say it's data path or control path that we want to monitor we would need a specific collector that knows how to do that so yeah I'm hoping that we can add or other user space applications like other control planes other programs that alter and modify the the the behavior of packets and but yeah not in a generic way no we don't have a user interface at the moment we just have the CLI we are we would like to add terminal user interface similar to what Perf has where you can have all the events inspect them expand these these little tables and collapse them filter and things like that so we want the user a terminal user interface but yeah we're it's just in the back lock yes yes yeah it's okay oh sorry so okay so the the question was how do we do the how do we convert the pickup filter into bbpf yes we do use pick up the pickup to convert it to legacy bpf and then we mangle we basically manually convert instructions from bpf to ebpf they are mostly a one-to-one relationship but not exactly and so we create an ebpf program that has that functionality that filter and we attach it to the rest of our ebpf programs that's pretty cool actually okay thank you right so the talk is already introduced I can skip the part and save 20 seconds well I'm safe for no anyway and I'm a little frizzy and right so we are talking about Satan but that's not the thing you eat we are actually talking about a lot of privileged operations in the sense of systems call system calls mostly second containers virtual machines some problems about them or what we think there are is a problem in terms of security and the solution we propose after that there will be a demo and then questions and don't overestimate us but we have to answer to your satisfaction so quick recall or what we understand as a system call so in essentially every modern operating system you have several rings right so you have a kernel ring you have a use space ring that might be ring one or ring four or in any case you have something like a system call abstraction in case of Linux it's simply that you have a process requesting resources or services from from the kernel and this is you know maybe perhaps that the main security model that a operating system implements so depending on your user or capabilities on Linux or context on Linux and BSTs you might get that these requests is granted or denied so if you ever try to insert the family's module that's called evil things as a user the kernel will not let you do it however if you touch your own files then the system call will succeed and the difference is just that well the number of the C-Score is different and root could do this and root could do that okay that's that's kind of obvious but useful maybe to introduce what we want to improve so quite often we see in container environments and virtualization engines or mixes thereof such as let's say Q-Virt or Cata containers for example to but even if you just seek to podman and Docker let's say the container wants to create a network interface that's usually a ton it's like the most basic tunnel interface on Linux and you need to do it the old way so without netlink you have a your control and you want to yeah right tell the operating system I want to create a network interface and on Linux this doesn't need root anymore but it needs capnet admin which means you can do pretty much whatever you want and you can create how many interfaces as you want spoofed traffic bring down the torquing so there are a few other examples of these things that are actually quite common like for example setting up the priority for real time virtual machine what you just need to affect the priority of one process not all of them there are other other other problems right now so for example you want to create a device node as a user you want maybe to connect to a specific demon or open a specific file and all of them there have been impressive improvements recently in Linux but wouldn't it be nice if we could just say okay I want that this process can create this top device and sure Linux security modules do something almost like that but those are kind of fixed policies they're not so easy to dynamically configure them per process so let's we start looking into it and of course BPF in sec comp is an important part of the story you can do second with a small BPF program where you say that you might want to deny or or accept a syscall based on its number not good enough because that's still generic big improvement again in Linux recently you can tell another is a space process details about your syscall this call second view notify and essentially yeah the kernel test you a lot of a lot of things arguments so container already use make use of second usually you define you have a json file that defines the syscall that are allowed denied or notifiable by the container so usually the runtime take the second profile that is part of the OCI spec and basically it use the second library in order to use to generate the BPF filter as we saw from what Stefan just described if you have filter is needed for filter the syscalls in OCI we also have support for second notifiers and basically the runtime need to communicate with the monitoring process through unit socket so basically this is the OCI extension so through this unique socket it tells the file descriptor where the monitor will receive these notification events and when this starting phase is done then the monitoring process is able to monitor the container and fake action if one of the filter syscall is executed by the containerized workload so there are already existing solution that take advantage of second notifiers like for example LXP or there is a King vault second page and however those projects as in common that they implement an handler per syscall so for example if you want to add a new syscall so maybe even changing the behavior you need to code it yourself so it's not very easy to reuse in so in order to do that of course you need to have a deep understanding how second notifier works and this is the place where Satan come to play so the idea is that if you are an admin or you are developing a tool you will be able to describe this into a recipe so basically you will have a match so basically this describe the syscall that you want to filter on argument and you will associate an action on it so basically this will be we choose to use JSON format and the Satan cooker is basically taking this as an input file and generate the BPF program and then a bytecode representation of matches and action we need the kind of launcher that install the BPF filter in order and then launch the real process that we want to monitor and this is the goal of Satan meter so the actual monitor is what we call Satan so basically this take gluten that's a bytecode representation for matches and action monitor the notifiers and then basically for the action in the alpha of your target so here you have a visible visual representation of the flow so we have two distinctive phase we have the generation of input that can be on a completely different building system so it doesn't need to be like we saw in the container on time when we start the container so cooker will read the recipe that you wrote will generate gluten that will be the input for Satan and the BPF filter that will be launched by the eater so when they either launched the target then basically Satan can start monitoring the target process so why Satan so we decide to choose a declarative approach versus imperative so this give you a better visibility of your of your operation so the privilege operation it's a flexible so you don't need to code an extra handler if you if you want another behavior this it's entire the Satan set up will take care of that so what you need to do is just writing in the json recipe it's a generic so it's an independent and self-contained tool we are not relying on leap second we are going to see that in detail but basically the setup generates the BPF program and the matches and the action that we saw so here you have a visual representation of a code it says nipple from a keyword but you could solve it with the json recipe so this actually some nice representation what we mean declarative versus imperative so of course security is one of the strongest and large use case that we have and so about for Satan rootless container so we want to target rootless container by reducing the number of capability given to them by impersonating only the necessary discourse and I think with outcome is that you can have a deep argument in introspection so for example you can check also complex objects such as for example string struct and buffers a new nice add-ons that we think could be beneficial is also counting the number of this is core execution so again this gives you a more fine grade control of what your process is doing however security is not the only use case we think also could be used in other context like for example testing so for example you could inject some error if you execute a certain c-school so for example you want to simulate how your application be a son different on different error you can also mock a c-school so not execute exactly on your system that particular system or maybe another thing could be injecting some delays or simply is deep and then continuing the c-school we have a deep introspection of arguments so that could be used for profiling your application so for example could be an alternative to tracing tool that use p-trace today we already mentioned for example could be also used for managing the resource allocation like for example second Pala you to inject file the script or into the target process so this could be an alternative way to SCM rights or the use of pdf ticket FT or maybe you could use case of you could be to connect maybe to container to application that runs into the container that don't have the by mount so the socket is not available to the both container so say that could take the file descriptor of both application and connect to them so those are just some example here finally you can see an example of the JSON that we were mentioning so you can see that there are two sections the first is the match so in this case we filter a make note with major number one and the subset of minor so basically we are going to do a privilege operation so the call only for certain type of argument and the action that will be performed is basically redoing the make note in the context of the target so you can see context mount caller the second example is what I was explained to you about testing so in this case we have a match we on connect on two different paths you can see test one sock and test two socks so if your application try to connect on test one will basically simulate the syscall because we return zero and in the second case we are returning an error is this minus one so this is are just some JSON example that you could describe with with Satan so now more right so right the cooker generates two parts right we say there is a BPF program because we need to tell the kernel please tell us about a number of syscalls not all of them otherwise we would have a few problems I mean that wouldn't be really useful if we just got all the calls like your calls like read write or send message or networking calls in general so we need to be selective we just want to get what we are interested in and this the role of the BPF in the kernel so great this is not in you know you see a binary search tree that's what a Lipsa company implements why do I have a binary search tree because well I have a list of syscalls and maybe maybe they are a bit more than seven maybe they are 200 and then every time BPF needs to check if it's a matching number well it's a comparison so we want to keep up it's actually quite relevant to keep up average complexity for for the search operation to something reasonable okay great for that we have big off log n but there is something that makes Lipsa comp job a bit simpler than what what Satan does because Lipsa comp is used typically to just deny or accept syscalls however we need to be a bit more detailed and we need to do like yeah accept deny or notify so we have two optimization goals and one thing that we found to be quite effective in our solution is to fill those those the bottom most level the leaves with some intermediate more jumps like there are redundant jumps to these possible actions that are sure here we just show the user notification or let the process do it as if nothing happened or block it then yeah right and and this brings me to what's overhead you might wonder we haven't been really scientific yet because yeah that would need a bit more time but essentially what we did is to try around 10 million I'll seek on a postmodern tenish years old something laptop and well the little got quite fast or maybe if you've got quite fast or something got really fast I don't know but just takes seven seconds okay so we tried a typical usage of the app that we see with Satan with filters that might make sense for typical podman containers that need to just mount a volume so it's 100 instructions that do essentially nothing plus we have some comparison and then we jump across this filter and that takes a bit longer eight point two seconds but from there we estimated that every comparison is something between 20 and 40 clock per instruction again there there might be something better on the market now so I guess we don't care or we do but but this should show that what we're doing is actually doable now we were talking about the BPF and this is the other part so this is the part that is adjusted yeah sorry for not intended actually consumed same by the user space monitor so the user space monitor gets notifications and now it needs to decide what it needs to decide what to do with it so this is pretty generic I'm sorry how obvious about how obvious this is but we have an area of instructions an area which is read only with the constants that you put in the JSON we have a temporaries area that's only a read write part and that's that's pretty much it and we have a structure that's really simple that's what second gives us which is a list of the arguments plus the idea of the target and looking into instructions we try to keep them to a minimum and we are of course concerned about picture creep but but yeah we are quite committed to not add more than this because otherwise what we are doing is we will not really be able to claim it's secure and it will not actually be secure so the options are well the obvious one okay check that the syscode number is matching what I wrote in the configuration a couple of them are specific to second so second allows us to inject a file descriptor atomically so atomically with a call meaning that the task cannot do anything else meanwhile and it is useful if you well you see that later in the demo but you can connect to something and the supervisor connects you to something else and replaces the file descriptor and this is actually safe to do we can return an error or success and then we need to shuffle a bit the data around because you have the configuration comes from jason and the process can pass whatever in it and a quick mention about the context so by context we generally mean namespaces on Linux we also enable specifying the namespace that several types of namespaces where we want to execute a syscode so when we impersonate the syscode we want to be able to for example yeah for a container to do that in its mount namespace and plus yeah obvious boring things such as a working directory you had the GIG tags so in this chasen recipe of course we need to have references between matches and actions because we might want to recycle some data security how bad is security with this so it looks like we kind of explained to your look it so far but that's not our intention and actually what turns out from a bit of experience we have with with yeah several yeah container engines or or virtualization no virtualization in the sense of of yen because yeah in that case you know you have a much much stronger isolation and you wouldn't need to use this at all probably but let's say you have something in between or a mix and yeah sometimes you need to I mean we we look at it and we think okay yeah actually I didn't want to tell a component that an RPC needs to pass a path about a file that I want to open what you need to do is to open the file and maybe there is actually a way so the the obvious benefit of this is yeah that instead of implementing several types of RPCs that we we saw in several projects you can have a unified mechanism and unified both in the sense that this should be generic enough to be used by different projects and also that for the same container or the same engine you can have a single place where you just say okay those are my set of privileged operations and nothing else we don't want to do the parsing in the supervisors of course right now it's 500 lines of code and we really really really hope to keep it that way there is a surface definitely there is a significant attack surface and in that link we listed a few considerations about those but overall we think it's not perfect it doesn't guarantee security by itself it's our magic but we think there are there is some clear value in this solution okay so now we have a live demo we have a website where you can find also all those emails so if something go wrong please go there okay so we have seen some example now we can see satan set up in action so first of all I would like to show you what we are going to execute so this is similar to the example I listed in the slide so there are different matches in the first one we are going to try to connect to a different pass so you can see that in the match we have a cool sock and we're going to try to modify the connect and try to connect to the demo sock so different policy different survey and in the second match we're going to inject some error permission denied when we execute the connect and in the third one it's to execute the rest of the connect so first of all we need to generate the input file so this is done by cooker that it takes as input the recipe then we need to generate the gluten that's the input for satan and the bpf filter okay so some prints but we are generated the file so what we would like to do it's just to print some to read some some file and print it in the server so I'm just generating a file that will be read and then we can start socket as a listening server and this will be the path where we want to actual connect with okay for and then we need satan either in order to launch our application and this takes as input the bpf filter and our application it's again socket that is going to open the file that I wrote previously and we want to connect to the cool socket so actually we don't have this cool socket but we try to connect to the server down so satan either is blocking because it's waiting there is some synchronization because we need to start satan otherwise we might need to my lose some some Cisco so satan takes as input the other five that we create before that's the gluten and it takes the PID of the ether okay let's try again live debugging is always okay we didn't start the other socket so let's do it again I mean it's live so that okay so now you can see that these are finished and we have printed the string okay the second part of the demo we are going to try to execute the same command but on a different path so the different path on this path we are going to inject an error so satan is always the same takes the same gluten files and I think you can see that we have got the permission denied so this can be a nice way you can test on different different behavior for your application of course if you are going to do another path in our case we don't have this socket this connect is not filter and it will be simply continued and this case it pays because we don't have it so those are the three matches that we had in the case so this is the first demo the second one we are going to use podman and try to create a character device so here you can see that in the match we have make not major one and a subset of of of minor and as a call we are going to replicate the make not in the context of the caller okay so first of all I want us to show you what happened if we don't use if we don't use Satan so in this case I'm just trying to I drop all the capability here so I'm not going to have cut make note inside a container it's a fedora it's try just to create make note that lol with some some so yeah I got the permission denied because the container doesn't have the capability so now we can try to do to start satan so it takes in this it's a slightly different flavor of I haven't generated so first of all of course we need to generate the input file otherwise of course why again this takes the jason file I show you previously again it's gluten and the bpf filter okay so now I'm going to start satan need root in order to be able to create a make note the device notes so it takes as input again the gluten that we generate now but in this case we are not passing the pit but a path to a socket this was the thing I mentioned into for the OCI integration for second filter so it's going to be okay so now I will start again the same container but I have added two annotations so I hope you can see it so the person in the first annotation I am reading so in the first annotation it's second bpf data and basically we are reading the bpf filter that we generate with cooker and the second annotation is the OCI the OCI support for for the second notifier and it takes the the path of the socket where the runtime will pass the file descriptor okay and then we are basically as a command create again they try to create the character device and try to list it so we will see if it's successful okay so you can see that now we have been able to create the character device because the call was actually performed by by by satan so in the in this case we have been able to even create a character device without make note capability inside the container okay we came to reduce the capability and the privilege given to containers important is the declarative approach versus the imperative way and you can find more information into our website future plan are of course finished and yeah right now we have very few sco's but we plan to add more we would like we would love to have feedback from you because it's a very new idea so if you have any concern yes speak up and yeah our goal is to try to integrate satan with container engines and virtualization engines such as cubert and special thanks to Andrea Christian Browner the his problems has been very helpful and you would help us to shape oops sorry I haven't realized it any questions yeah I mean the first demo so the question is if it's possible to do and can know what with outside of a container yeah sure I mean the first I mean the first example was without a container and you could do the exactly the same so yeah the program need to be loaded yeah I mean you have seen the two flavor of satan I mean the in the first in the first example in the demo we have used satan eater because it's exactly I mean we need to launch the process by installing previously the filter yeah thanks thanks for your time hello good afternoon my name is Lukash and I would like to present a short talk on how we test graphical user interface with open qa in Fedora but this talk is not going to be an informative talk only it should be a it should be an evangelizing talk and maybe it could inspire you to start using open qa if sometimes you might look for something that can help you testing stuff in GUI open qa is an automated test tool it's mainly developed by Suze but we also use it in Fedora and my colleague Adam Williamson helps developing it too and open qa is fully integrated into Fedora it's packaged for Fedora so if you run Fedora you can just install it in 15 minutes using the RPM packages however this talk is not going to be about installing open qa or setting it up I talked about it last year on Fedora hatch and I'll probably talk about it again some day in the future doing a workshop but this is just aimed on graphical user interface testing open qa is good because it allows you to test various operating systems so when I talk about Fedora you could replace Fedora by anything it could be rail it could be open Suze it could be windows it's up to you what you put inside the virtual machine because open qa runs a virtual machine and performs the tests on the content of that virtual machine good thing also is that you don't have to install anything into the tested operating system so you can just take the basic operating system you don't need to add anything to it you run it in a virtual machine and you can work testing it so it basically creates and runs a virtual machine this can be based on an ISO file or a QCOW 2 file and you can perform various actions inside that virtual machine and evaluate the outcome the architecture is that there is a open qa web application with a database that's a sort of a scheduler you communicate with it using the browser or a REST API and then it has a worker that performs all the all the jobs that it downloads from OS auto-inst it has the tests it runs VNC and it uses QEMU to operate the virtual machine the controller does the visualization job handling live viewing results and it stores the results in the database and the worker tests runs the tests and checks the needles and does all the other things that are needed to work with the virtual machines you can have the the worker and controller can be on one computer so you can run it on a laptop for example but it also can be split and you can run a scheduler or the controller on one computer on one machine and you can run workers on other machines so you can have a bunch of workers and do a multiple tests at the same time of course you can do it in on one machine to if you have enough RAM the test itself is a pearl script although it should be possible to write those tests in Python too I have never tried because the test language the test pearl is pretty easy so it's just the basic pearl commands and and some other commands from the open QA and the test defines what you do inside of that virtual machine so it mainly defines mouse and keyboard actions and it checks and evaluates the needles we are going to talk about the needles a little bit later so you will know what that is and it evaluates the expected what you what you say you expect in the test script so then you can compare it to what you get in the virtual machine and of course the test or the job can end with various statuses such such as past failed soft failed cancelled and so on and so on and so on the tests are pearl modules and they are placed in the test directory that's probably not that important but if you if you see the open QA instance so there is a test directory with the test scripts and there is also a leap directory where are the libraries with commands that you can use and the routines we are going to talk about now are documented in the test up a documentation which is open dot QA API test API so you can check it there because I am not going to talk about it on a deep level I will just show you what you can do and what routines you can use to to test the graphical user interface actually each test has a header where you define what libraries or what files what packages it will use you probably know if you know pearls so you know the use strict and maybe use warnings these are useful things to do because if you omitted then a pearl will not tell you about your own mistakes so troubleshooting a test script is rather rather complicated then and we say that we use that install test and we use the test API library and the utils library in this header in Fedora these are mostly the header is mostly standardized and this is I would say a typical header for the majority of our graphical user interface tests then it should have a subroutine called run where all commands must go you can also use other subroutines in the test scripts but then these are only visible in the scope of that module and not outside of it so you can prepare some routines of your own you can use it then in the run subroutine but outside of the module only only the subroutine run will be visible and then it ends with the test flags the test flags specify what to do when your test finishes and by default all are off so you can switch it on like this for example all with rollback means that after the test finishes it will return back to a saved status because you can save the virtual machine on some point then you can perform various activities and after these activities are finished you come back to the original state which is great which is a great thing to do when you don't you want to start from the same point all the time for example if you have a graphical application and you start using your mouse and clicking on widgets so you are moving the state of that application and then for example you could start the next test you could start at a certain point for example you open the help window in one test because you want to test that the help window can be opened so you open it and then you do another test that uses another widget but the help window is still opened so you can design it like yeah you can do it like by design I want to continue when the when the help window is opened but then sometimes the test fails and the help window will not be opened and it breaks the consecutive tests also so therefore it's it's good to start from a certain point and you can save it and then you can always rollback to it other possible flags are fatal which means that if the test fails then everything fails or milestone which means that now after the test it's the point where you want to save the status of the virtual machine and no rollback if you don't want rollback or ignore failure if failers should be ignored and then each pearl module must return a true return value so right at the end you have to place one this was a when I started with open QA I didn't know that and I often would omit the one and then I I was getting you know errors all the time and I couldn't run it and I didn't know why and then I realized there must be the one okay so now you want to test a graphical user interface which basically is you want to see something you want to see some widgets and you want to interact with those widgets using your mouse or using your keyboard so open QA lets you do the following actions for the mouse control you can set the mouse to a position mouse set this is good if you know to which position you want to set the mouse normally you would define the resolution of the screen in open QA the resolution of the virtual machine and you maybe know that the position of the widget is on 500 600 so you could move the mouse to that position directly most of the time this is not the case and I will show you how to fix it you can hide the mouse cursor move it out of the of the screen because sometimes you don't want that cursor to affect the process of the of the test you can click on the mouse you can double click on the mouse you can triple click on the mouse each of those command you can select the button to click if it's the left the middle of the right button how long it should be clicked but I don't want to go that deep right now and you could also do a mouse drag which means you start on some point you hold the mouse button and you drag to another position mouse scrolls are currently not supported I visited the open Suze booth this morning and there were some nice guys doing open QA also and I had a nice chat with them and I asked about a problem that I have with a certain needle type and then Defolos told me and yes but this is not such a big problem but if somebody wanted to create a patch it would be mouse scroll so they also want mouse scrolls we want mouse scrolls but nobody has implemented it yet and there is a workaround we are using in Fedora tests and it's a keyboard workaround so key events that you can use this you can send a key which basically means press a key or press a combination of keys press and hold a key is a good good point when or a good stuff when you want to interact with the mouse so you can press and hold the key and press the mouse button and the release both of them that might be good for some sophisticated applications release a key of course is when you're pressing and holding it so releasing a key releases the key send key until needle match basically is press the key repetitively until you find what you expect which I am using when I need to scroll the mouse so I am sending like a down arrow or a tab or something and then the GUI scrolls you can type a string in Fedora we have wrappers type safely and type very safely because you can specify in type string you can specify how fast it should type and so on and because we don't want to repeat it all the time so type very safely types really slowly because when you just type a string sometimes we were getting errors like it typed so quickly that it missed a letter for example or it pressed the letter so quickly that it produced three consecutive letters like instead of password it produced PAAA as W-O-O-R-D-D-D-D-D so this is not what you want so therefore we are typing very slowly with the type very safely command our type password is similar but it doesn't get logged in the log files so that nobody should know about it and enter a command is basically again type a string but this time it adds a enter press at the end of the command so you can have this one to run commands and now we talk about needles that are crucial part of the GUI testing because how the test machine should know about what you want to see and it's because you can compare a pre-created image with the content of the virtual machine so you have a screenshot with an area defined that you expect to be found and then open QA compare it with the content of the virtual machine and if you get a match it's good if you don't it's it's not that good unless you want it so you have various like assert screen means that you want to check that something is there maybe a widget is present or an icon is visible on the screen check screen is similar but it doesn't fail if if nothing is found and instead it gives you it gives you an undefined value so you can use some kind of diversions in your code like if you find this do that if you don't do something else assert screen and click is that if it's found it clicks on it assert screen and d click is if it's found then just make a double click then for example what we also use is click on last match for example if you have a check screen maybe so then it finds a match and you don't have to write another assert so you just use click on last match and it will pick up that last match and click on it so you can say if check screen something click on last match and if not do something else and sometimes also it's very important to wait until a screen changes or to wait until the screen stays still because sometimes open QA is pretty fast it's it's much faster than a user is so we had problems with KDE for example because KDE used sort of how do you say that yeah the the words slip my mind okay sort of animations yes that's the word and those animations took some time but open QA was so fast that it clicked somewhere even before the animation was ended and the test failed so you say wait until the screen stays still and then it can do animations as it will and the test waits for it and then finds the needle it should there are some more you can record a soft failure you can record informations and you can save a screenshot however it didn't work for me to save a screenshot I don't know why I must figure it out but it should be there you could also provide variables either in the test setting or when you start those tests you can say you can define a variable and then you can use the variable by get variable or set variable or check variable so for example you can say that in the test settings that you can say that the tag for example is gnome and then you can have one test script that could operate on gnome and KDE and it oh it knows that the variable is set to gnome and it will only do the gnome part on that one particular run and if you replace the variable when you start the script then it will behave the opposite way and it will also only run the KDE parts so what is the needle how it looks like so basically this is a screen with a graphical application and this is the calculator the gnome calculator and I took the screenshot from a real test real life test so this is what we expect a user might see but because it's so complicated when we compare the entire screen so that it's there are so many problems maybe like these little tiny dots you know we don't want to compare the entire screenshot but we want to just compare if there is the button with the number one so from this screenshot we define an area that the open qa should try to find which is the button one and now it only looks for this particular part theoretically this particular part could be found on any position of the screenshot so it doesn't have to be just here and if for example the graphical application is moved to the right part of the screen it will still match when we think about the needles so it consists of two files there is the PNG screenshot and a JSON description file a needle must have at least one area if it should be a clickable needle then it should just have one area and the click point lies in the middle of that area the needle also needs a tag because it looks for them according to the tag and you can have more needles with the same tag the bigger the area the more risk of mismatch and you can also set the needle fuzziness that can be adjusted from 0 to 100 where with 0 it would be probably everything is a match and with 100% is like a totally exact match is a match so by default it's set to 96 and we sometimes go to 90 this is the JSON file it looks like this that's the tag to find the needle and you know that there is the expose I pause at it's the coordinates where to start then width and height are the width and height of the area and the type is match that means it can find it must find a graphical match and the match fuzziness is 90 okay need for needles yeah this is basically that sometimes you need to make more needles to cover for one tag because fonts differ colors might differ and stuff might differ so for example if you want to check the buttons on that calculator you would need 24 needles at least the example the test example how you would click on those buttons and you would solve the three times bracket three plus two bracket closed equals sequence you could you could solve with these commands so it's basically a certain click a certain click a certain click a certain click and then compare the result and that's it and then delete it with escape for example and you see that this would require eight needles to do just this so we have some more time so I would like to show you the real test this is a real calculator test that does various clicking and tries to do various various you know examples or equations it also shows the help it also shows the about window if you take a look on the code it's slightly more complicated but because we were trying to use a subroutine that does the calculation so that we don't have to repeat all the needles all the time and the last thing I would like to show you is the video I hope I can how can I make it slow speed yes so half the speed and so now it logs into the workstation it starts the graphical application called calculator and it shows the about now it clicks and solves the equations so this is so designed that to solve those equations it must click on all of the buttons so we know that all of the buttons work and we also know that the help worked so this is a very short graphical test of the known calculator and this is everything I wanted to show you today it's so easy you can start right away I measured it the installation and setting up the open QA takes 15 minutes 30 minutes when you don't know okay do you have questions the good thing is can I repeat the question how do we test GNOME with Wayland or KDE with Wayland the good point about open QA is that it doesn't care about Wayland at all so it only compares the screens that the virtual machine is giving so if it runs Wayland or Zork it doesn't really it doesn't really matter yeah so you can test Wayland just fine yeah pretty you can test pretty much everything what can be run in a virtual machine are fun yeah you cannot do it I don't because it's not implemented so there is no way how you could how you could tell the the open QA now scroll the button yeah when I when I talked with the open Suze guys they told me that probably there is a routine which communicates to VNC and sell and sends scrolls but yeah so you can't use it right now but maybe we find some time and way how to do it and yeah yes yeah if you want to you can contact me any time and I could help you and show you how to do it yeah of course we have a repo where we store the the tests this OS auto in this Fedora it's on Peggy here and basically you see that there are the tests here so in this we have applications and each graphical application has a dedicated directory where you put all the tests you want to run I think that it's useful to split the entire application into various tiny steps and start from the beginning again always rollback always rollback because then you know these tests past this test has failed so let's just investigate more this one if this is just one entire script and something fails in the beginning then the rest doesn't get tested never yes of course no problem with that yes but because we don't want to okay do we have coordinates only to find the place to click right yes and no basically the engine operates on coordinates but I want yeah I would like to show this one small so I could expect that the digit number one would be on maybe 800 200 or something which maybe or maybe not the case so therefore there is the needle system that you say that you want to look for this particular area so a little bit gray with one on it and open QA compares it with what it has in the virtual machine and it will calculate the coordinates automatically and it will click in the middle of that area so it does it calculates the coordinates for you but if you want to specifically say it needs to be coordinate x and coordinate y then you use the mouse set and it moves the mouse to the exact coordinates okay enough yeah times up so thank you very much for your attention and have a great DevCon thank you okay we're good to go hello everyone welcome thank you for coming and thank you to the organizers for letting me speak giving me the slot my name is Akilea I'm an engineer at Red Hat on the image builder team and today I want to talk to you about how we not going to talk so much about how we build images but how we think about defining and configuring the images and letting users sort of like invent the universe of image definitions let's say so one small thing I'd like to say I don't know if this is a good idea but I'm gonna I'm gonna be using it few like some terms that we use internally and with these things it's always a little hard to understand we're trying to explain everything of course but it's always a little hard to know what other people are familiar with and whatnot and whatnot so if I use a term that's unfamiliar more than a couple of times raise your hand and we'll see if we can keep the ball rolling so that I don't lose my audience so yeah let's get the show on the road a quick overview of what I'm going to talk about I give you a little bit initially but so we build images and it turns out that building images is pretty easy but making sure they boot and that they're useful it's tiny bit harder so the trick is to sort of restrict users and the code itself so you know so that it's hard to build sort of invalid configurations or unusable things but we also want to give users the power to sort of like explore the space of what kind of images you can build and these two things are kind of in opposition to each other right so you can imagine a solution where you know you can only like a project where you can only build five kinds of images or even one that would be really easy to test you can't do anything else except build that one image it always builds it always boots it's fine it's just not a very useful project on the other hand you can imagine a situation where you can build almost anything but most configurations wouldn't work so we kind of need to find the sweet spot there or sort of guide users and ourselves to only build things that make sense but also be powerful enough so so that it's useful and and so what I want to talk about is our solution or like our way of thinking where we sort of define abstractions about how we define images as configurations of components and these components inform both the way we implement things in code the way we define our own sort of library but also how we how we want to present these configurations to the user so that they know what we're doing so part one let's have a look at image builder image builder builds images and let's have a look at that the first you know give a bit of an overview about what it looks like to use it and I'll talk a bit about like I said how we build images but most of the talk is about how we define them and how we think about what an image type is or an image configuration is this is image builder is what it looks like when you want to build an image this is image builder running on console Red Hat com and this screenshot shows sort of the first step of the image building wizard and this is where the user sort of selects the target platform which defines the kind of image that we're going to build in this case you can see Amazon Web Services selected so we're going to build an AMI type image and make sure it works with Amazon Web Services by the way image builder also works on premises we have it's not just the Red Hat product but also a run on your laptop and build images from the command line or from the composer so it's like a follow-up step sort of like halfway through the image creation process where users can sort of select additional packages to add to their image in this case we use the user selected engine X so presumably maybe they want to run a web server and and and what I'd like to do is sort of explain how these options and others like them like selecting images or selecting the target platform how they affect the process of creating the image and how we might be thinking about these configurations in the future so at the core of image builder like sort of like the bottom layer of our stack is OS build and OS build is a command line utility that takes in a manifest and returns one or more more file system trees and it does most of the work well it does the actual work of building the image and a manifest is a giant well giant a rather big blob of JSON that describes a series of pipelines and steps so you can see sort of here like there's an RPM stage there's a kernel command line stage post name stage these are all steps that sort of modify the file system tree in very specific ways to create a file system tree that resembles the blue boot of the image OS build has important no OS build has no knowledge of distributions or workloads or anything like that it quite simply and very stupidly execute stages as described in this manifest and just returns whatever the result is at the end of the process this is sort of a simplified way of simplified look of what the JSON object look like I just pulled out the names of the pipelines of the stages and you can sort of see that each pipeline look broken down into a series of stages and most of these stages have pretty self explanatory names if it looks like the name of a shell script that probably what it's calling so for example the RPM stage installs back RPM packages into a tree the cloud in it stage configures cloud in it system D and enable services you can configure the bootloader install the bootloader configure engine X and the nice thing about stages is that they define their inputs rather strict strictly and they don't sort of expose the full well they don't always expose like the full sort of like breath of what you can do with with with each thing but the kind of in the type safe way modify the tree to create a bootable image and so on its own like I said I was filled doesn't really have any concept of what a distribution is or what it's actually doing in the grand scheme of things so it isn't very useful on its own and no one's expected to write manifests by hand or even understand the the individual stages that we define so we provide a library that right I agree that holds the sort of like the domain knowledge of what a bootable image or a specific distribution looks like and how to and the library knows how to create a manifest that will accomplish that and within this library which we call those those are is is where we define the the base images that we present to the user like the AWS image that we saw in the first second slide so right so again following up sort of repeating again what I what I'm trying to say here is that always makes no guarantees about what it's going to produce for any given manifest in the sense that a manifest doesn't guarantee anything useful or or bootable always will compose around the other hand needs to produce images what needs to produce it needs to produce something that can be built and needs to match what the user respected the user expected right and ideally it should be useful and the easy way to guarantee this like I said in the beginning is to restrict user choice the smaller the configuration space the fewer things that can go wrong and this is sort of what we have right now so when we talk about image types like the AWS image type we refer to an image or an archive that contains an operating system tree and it's sort of a predefined configuration that matches a distribution of platform and an environment kind of like this at the three top configuration and additionally the user can add their own little user customizations like we saw in the example where user can add engine x to the image build and these user customizations here are basically only the real they're the only real control the user has to affect what what will come out at the end to actually like to apply their own preferences to what they want from the image so for example you can build a fedora 38 image that can run on x86 in AWS and then add an engine x to the to the image or you can build something like a rel 9 image that runs on arm in Azure and add a 20 gigabyte OPT partition and and to repeat myself a bit in code as well as the user interface to an extent it's these three first components that we use to sort of define the image types and we do that in a rather static manner so for example if we if we never explicitly added an image type that is called fedora 38 on arm on AWS then that's that doesn't exist essentially in our code so even though it's a perfectly valid configuration we don't support it it doesn't exist as far as we're concerned and and and so each choice of these each choice of these components of the distribution platform which is like the hardware architecture and the environment it's going to run in have very specific effects on the image building process the the they they're associated with different configurations of the stage that we saw earlier to produce an image so for example the distribution so it defines the base packages that you're going to need and repositories that you're going to use to download them and then the platform again specifies the specific subset of those repositories and some additional packages for like the bootloader or the firmware and then the environment which in our example was sort of the AWS you add additional packages and configurations to make the image run there to run in that environment and so we would like to sort of move away from this sort of static configuration situation that I described earlier where like everything sort of defined as static combinations of these three components and towards defining the effects that these that the choices of these components have on the image building process and if we can define if we can have like a well-defined set of what these things do then are sort of like how we can expand our configuration matrix that's valid and so this is the part of the talk where I describe where things that don't yet entirely exist so this sort of like where we move it where I imagine that we're moving with the project and some things that might be more conceptual for example and might change in the near future but what we want to do is to find the choices we saw earlier as abstractions in our code and then conceptually sort of well in our code but also conceptually and then help us sort of reason about what an image definition is and how we present these choices to the user so we can go back to this sort of like breakdown of the of the components where we have the distribution the platform and the environment before we sort of used in this sort of static way to define what an image type is and then a workload which sort of defines what the image is intended for and and so the distribution as before you can imagine like I think we all know what a distribution is you so a platform which is sort of the watch it which is the hardware-active architecture but maybe it's that more generally could be a specific device as well the environment like we saw before it could be a cloud environment like Amazon Azure or Google Cloud or it could be a bare metal environment and the workload which again like it's sort of the intent of what kind of work the operating system will be server will be running and you can you know you can sort of think of the workload about the kinds of things that you would normally do at provisioning time so we can sort of sort of start selecting this we can go through sort of an example of how of how select selecting each one of these components making a choice in each of these components has an effect on what kind of stages we will be running and what how they'll affect the image building process itself so for example you pick a distribution like Fedora 38 and that sort of restricts the repository well it selects the repositories all the Fedora 38 repository so if you're going to build a Fedora 38 image you need the Fedora 38 repositories you if all you know is that you're building a Fedora 38 image the you start up with the base package set which is the core package group and then if we're going to build this image we need the Fedora 38 build environment so that's all we have now we only know that we're building Fedora 38 this is what we start with and then you add on top of that the option that you write you'd like an x86 image so then you restrict the repositories to the x86 repositories don't need to be downloading anything else and then on top of that you need to add for example the bootloader packages for x86 and at the same time you need the stages in the build process to configure and install the bootloader so and then at the next step in the environment if you select you wanted to run on AWS it would be good to have the cloud server package group and the cloud init which is how images get sort of like provisioned or configured at first boot through the AWS cloud console and since we're installing cloud init then in the stages we also enable the cloud init service using the system d stage and then finally you might select a workload like a web server which would add an extra package nginx in this case and then since we're adding nginx we also want to enable it and configure it so that's an extra option on the system d service then the system d stage in the manifest and then the stage to configure nginx and so if the effect of each of these components on the image creation is well defined we can sort of freely combine them and we don't have to stay in this sort of like world of statically defined configurations and instead of explicitly defining valid combinations we can sort of restrict the configurations only to the things that we know are invalid and then sort of like just let the space of valid of known configurations exist we can sort of expose the entire configuration matrix in code to the user and know that any combination it would be would be possible and so we're thinking now about what choices we can what choices should we give users and how can these choices map to the components that we're talking about here the components that we define and so like the current state like I said is you have this sort of like static list of these the triple configurations here the distribution the platform and the environment and then you give user a few knobs to sort of like add some packages or tweak some configurations and like it's sort of this this config the static configurations here are sort of exposed to the user but not really like there's a if you're if you're on a system with a of a given distribution or if you're if you're building for a specific distribution then you're only limited to a handful of image types that we've already defined and sort of the identity of the image is a sense is essentially this static the static configuration here and then the user sort of just tweak a few things and our new set up on the other hand sort of puts equal weight on each of the components so our question is why don't we just expose all the components to the user all the components and sort of let them freely snap them together and create what they need so what this would enable us is to sort of like both ourselves and the users define just select each configure each of these sort of components and what what we do in in turn is abstract away sort of the meaning of the components in terms of the image build process but not the meaning of what's going to happen to the final to the final image and so my question here the reason I'm talking to you all about this today is to basically ask you does it sound like a good idea at least this is how we're thinking about things and like does this is there's something here that kind of is there something that we miss is there is there a case that kind of wouldn't be covered by these configurations by these components does it sound like a good idea to be able to just have an image like the one we saw but instead of on AWS you can just swap out the environment and have the same image for the same workload just able to run on a different cloud service and be optimized for that cloud service and that's pretty much it thank you questions and comments are of course welcome and this is the website for our project and the github repository for our organization so the question was what's the motivation in general and I guess I guess the extension to that was how does it differ from if I'm getting like a base image and provisioning it and modifying it well there's a couple a couple of answers to that first well someone needs to be filled the base image and there's so there's a lot of alternative projects that do exactly what you said which is like get the base image that's just like in our example for example it would be just get the base fedora image and then add the bits to it to make it so there's a couple of things that this project does solves that we think is a good idea at least first of all every time you build an image you get fresh content right if you if you're if you're building an image today to deploy it today you're gonna get the up-to-date packages instead of maybe like needing to get in an image that was built a week ago and or a month ago and need to update it you can there's there's also the concept of like you're building an image for purpose and it's well there's a lot of there's a lot of things that are either impossible well better to do or must be done at build time instead provisioning time there's there for example I think openness that is what we have like hardened images you can build a hardened image with certain configurations and take an existing image booting it and then flipping those configurations isn't considered like compliant with certain security standards for example I'm sure there's there's a few other reasons in use cases everyone on the front row can probably answer your question in a different way partitioning for example is a good yeah is another way like you if you get an image but we can partition during build time create any partitions that almost any partitions the user wants and if you just get an image and deploy it to AWS you can you can add certain you can add partitions after the fact but it's it's certainly not well I guess like if you if you get a rel image and deploy to your Ecuador image and deploy to AWS you probably can't create a user partition after boot after booting it right so there's there's a lot of reasons you want to configure something before you even build it yeah so the follow-up question was I guess it was too well the second part of the follow-up question was how long does it take to build and if it's provided as a service yeah so yes it is provided as a front service on console red hat com now for all red hat customers and the image build time is well it depends on the image it can be as low as I guess five minutes up to I guess 10 or 15 if you're building and like the the cloud images usually take like between five to seven minutes and we can also like deploy it for you directly to the cloud environment we can we can also build ISOs and those usually take a little longer repeating that there's also the on-premise version that you can run if you have an RPM based operating system you can DNF installed OS build composer and have it running and build your images now I don't know who's first you so the question was how do we how does this model work for situations where you might not know the workload you just want to build an image well so it well first of all the sort of like the workload situation it's kind of optional it doesn't need to be a predefined workload the way we kind of works now if we map this sort of like new way of thinking about things of the old way we would of course have a custom workload or even an old workload where you you don't get any extra packages you don't get any extra configurations you just get the base system obviously you can't you can't skip the selection of distribution and platform that wouldn't make sense environment you could have like a base environment that sort of it doesn't belong to any cloud or anything it's kind of just something you know it's just something that that can maybe maybe maybe you can have a think of think of it as like the default is like a bare metal installation for example but that would totally be an option right it's like any there's they're saying defaults I think for for the environment and the workload that we can sort of think about and yeah custom software the follow-up question was how would custom software be layered in well that depends I mean right now we already have the capability of letting the user define their own content sources if you have an internal in your company an internal RPM repository or a public one that's third-party you can just define it and pull content from there and it'll be installed we can embed containers at build time as well I mean any other ideas were open to suggestions about how to get custom software yeah yeah you can also inject custom files at build time for the on-premise version you can just write a file and it'll get injected at build time into the directory the question or am I out of time okay do we integrate with Ansible short answer no long answer no yeah there is in Ansible also Ansible integrates with us right yeah yeah which is sort of what OS build is doing I mean not sort of that's what always build is there how much so first of all how much how much time do we have because this this seems like a discussion for over there and I think there were a couple more questions if we have time to get to there was one there before sorry can I get some sorry start again I didn't hear anything oh yeah of course the question knows if we consider other platforms and yes we can build okay we don't have virtual box yeah we have right it's the same as okay so we have yeah right and that's sort of the thing right it's not just about making the file the file format correct it's also about knowing what does this virtualization platform need and you know we're always we're always expanding we're always adding more stuff and that's sort of what part of the reason we're thinking about this we're rethinking about how we put these things together is that if for example we said let's add let's have let's do that let's do vagrant right we would figure out vagrant for well nine and then it'll be like and then we would we would be thinking okay this is this is livered for well nine x86 and then we would have to go through that again and redefine it for every configuration and now what we want to do is say okay what is what is vagrant need it needs these things we we define them in sort of like a well-defined way and how they interact with the other components and then you can drop it into any configuration and it should work okay so the question is if you if you want to do something that isn't covered by OS build you need to write your own OS build stage yeah I mean that's kind of the way to do it like if there's if there's some kind of configuration or there's something that needs to be run at build time that the OS build library doesn't cover with a stage the answer is right a stage but the the the best answer is write a stage and send it upstream please because we need them right is that no it's not well there's someone behind me and I think you are the what if you want it so the current state yeah the question was yeah the question was we have the yeah we have the manifest we have the UI and stuff but what if I want to version control my sort of set my configuration my image type of what whatever you're doing well the good news is the the way you build an image on premises is with what we call a blueprint which is a tumble file and I'm out of time but it's a text file you can version it we also can we also support versioning the way it works now which might change soon but the you write a tumble file which configures the image and you push it into Composer and Composer itself also has a versioning system that might be going away but it's text files you can yeah you can version them and even in cockpit Composer which is sort of like the web UI that you run on premises you can click through and configure the image and then you can export that configuration as a tumble file and save it where I'm out of time thank you all very very much hi my name is Clemens Lang and today we not we need to talk about your use of root privileges in containers I start out with a bit about me I did study computer science in southern Germany in 2011 got interested in open source projects while doing Google summer of code with a Mac post project I'm using a Mac here and after that landed a job at BMW doing infotainment so that's where I spent seven years building infotainment for cars first doing some software integration work doing packaging then and this is where the connection to today's talk comes in we wrote a thousand line of C single binary container on a runtime to simplify building software for a platform and I got very familiar with the various namespaces involved in doing that then switched to over the software updates for the entire car I think with were the second in the market after Tesla to do that for the entire car and for the last few years did security at BMW SSO secure boot and all that you know I say Linux whatever you have and that eventually brought me to Red Hat where I now work in the crypto team so first note I don't do anything with containers in my day-to-day work right so this is sort of a talk from an outsider and what I'm doing at the moment is I patch up in SSO I try to get rid of char one where I can you know that might have fallen in some of your feet recently one of the culprits was me probably and I'm also dealing with fifth certification because that's a fun topic for everybody so route and containers right before there were containers there was some conventional wisdom that we always applied if you run a service create a separate user for it just for separation purposes right and while you know then Docker came along and we started running everything as route again because that's just what Docker did but nobody thought this was a problem and honestly we still don't think it's a problem and probably also isn't that big of a problem right so just to put the entire talk into perspective for you and I brought an example here let's assume I create a directory and var lip and then I bind mount that into a container and that container is running as routes I'm running the entire commander's route and inside that container I just happy happened to copy bin cat into that bind mounted directory and then run a chmod command to give it that said you add a bit so that means anybody who runs that binary now will get effective route permissions and that also works outside of the peanut container so if I then outside of the containers which to use a nobody run that particular binary and give it an argument of ETC shadow then I will get the contents of that file even though the nobody user should normally not be allowed to read that this isn't really a surprise to anybody and it's not really a security vulnerability as is because you know if you bind mount stuff in then you should know what you're doing so this isn't a surprise but you know if we were running this with rootless potman then this wouldn't be an issue you would still get this at UAD binary but it would be for that particular user that might not have privileges to do anything else so we could improve by running this container as rootless potman right so running things as rootless potman that brings us to the issue of networking because if you want to offer a service then you probably want network in your container and in potman in rootless potman there are basically two options either it's slurp for net and s that works using a tap device but has a couple of limitations for example you can only communicate among various containers by via exposed ports on the exact same host or using the local host interface and all the requests that you get will seem to originate from the IP address that's associated with a tap device so you lose the information of where the request was actually coming from and you might want to use that to filter you know for which IP networks you offer a particular service then there's some improvement on top of that where you and essentially take slurp for net and s put it in the user namespace that owns a network namespace and then do a common typical networking using net a vark and after that that's great because it now allows you to do standard networking between containers but it still uses the slurp for net and s tap device so all requests still originate from the same IP address so this isn't really ideal if if we want to do is run a service so the question that I was asking really turns into can we run each container as a separate rootless potman user but with the proper networking proper in quotes as in the root for networking right and that's the question that I want to answer today and so you now see my motivation is the outline for the talk we have to talk a bit about theory but it will be quick I promise you and then I'll outline the various solutions that I found by following a mailing list post in the presentation somewhere in the potman user group in 2021 I think that was recently removed from the web server boo I have to go to archive.org to get it now right let's get into it some theory why is it that inside of the container we can even read the file that's owned by root right to know that we need to understand how username spaces work username spaces separate they basically give you a separate UAD range and that UAD range is mapped from inside the container to outside the container using a mapping file and this is what does this mapping file is you can do this with pretty much every container that uses username spaces and essentially tells you zero in the container is user a thousand outside of the container and repeat this for the following number of UADs in this case one and then after that one the UAD one is mapped to UAD five two four something for the next sixty five thousand and three hundred and something UADs and this is what it will typically look like on a modern system we have sub UADs and the rule for accessing files is that containers and I look this up in the kernel can't access I notes that are owned by UADs and GADs that are not mapped inside your container so if you don't map the UAD zero then nobody can access roots files it's as simple as that so that's theory part one theory part two is on networking for that you need to know that any non-user namespace is associated with a particular username space and if you want to do an operation inside that namespace then you need the required capabilities in that username space to do that sounds complicated but will be a lot easier once we get to the example so managing a network connection requires cap net admin that's the capability that the Linux kernel checks when you try to modify the IP address of a device for example if you now do this in a network namespace owned by a username space in which you are root then you have that capability usually and that action is allowed however changing the hosts network namespace requires cap net admin in the hosts username space which is the root namespace that you typically don't see in configure but it exists and that also tells us something it tells us if we want to use the real networking then at some point we will have to have cap net admin in the root username space so there's no way of doing any of this without using actual root so for some pieces we will need root so my first idea when I was trying to do this was okay I'll start a rootful container but potman offers me this this flag dash dash UAD map that allows me to configure which UAD mapping I actually want so let's just not map the root user from outside into the container and then the problem that I was trying to solve the soft right this is what these lines here do so essentially I'm saying zero in the container maps to the user that I'm currently am outside of the container and then the second UAD map line just does the same thing with sub UADs we will ignore for now because it's you know we don't need to understand this to understand the the concept that I'm trying to go for so when I prepared this talk I rerun this command and my heart almost stopped because I thought wait this didn't use to work and this is the error message that I used to get when I did this turns out this is fixed in potman 45 right so this is what you should be doing thanks for coming to my tattoo goodbye yeah so I mean now this works but I'm still gonna tell you what I did before it worked and there might be you know you might learn a thing or two and you might still want to not do this particular solution but we get to that so I did some googling and I found a presentation that essentially said yeah you can run these two commands then run set up the network manually and that should work and here's a link to a mailing list post of a guy who probably did that at some point in time and then I clicked that and thought okay this looks nice I can probably do this and the idea here is that we you know all of this is as user now so we are running rootless potman I create the container note create not run or start or any of that I create the container without networking I give it a name because we will need that name for the for the commands that follow right after I run container in it that's also an interesting command because what that does it sets up all the namespaces but doesn't actually start anything inside your container so you have the namespaces available then you have time to modify them for example the configure network which you know I did here in a script let's call it magic.sh and after that I run potman start and my container starts as normal so the question really is what's inside this magic.sh and it's this and it's kind of a lot right let's go through what this does I will take potman and spec to figure out the PID of the container that we initialized but didn't start then this next the sudo ln line is really just to give the network namespace a name that the IP utility from the IP route package can use and then I do what potman initially or internally also does which is set up a virtual ethernet pair move one side of that ethernet pair into the container then rename it into the inside of the container to eth0 which is what we would expect the network's name to be and then bring it up on both sides and configure an IP address so we can have communication going on that is a lot of work and note that we haven't actually we haven't well we had to do manual IP config for this so I had to choose an IP address and set it then potman inspect won't know about this because we didn't use potman to configure the network we must repeat this every time we start the container and I haven't yet dealt at all with exposing ports which requires right writing firewall rules which I really don't want to do manually this is like tedious work and I yeah I could probably write a script but at this point I was almost I was about to give up on this because I thought now I don't want to deal with firewall rules and exposing ports I don't want to let's just run everything with root and then I thought how does potman actually do this where's the code in potman that creates all these network interfaces and chooses an IP address and assigns all of this and turns out there isn't because what potman does it calls net awark for this and just pipes a bunch of JSON into it and net awark takes care of of configuring the network interfaces moving them into the right namespace and so on so I thought can I just you know create this JSON configuration pipe it to net awark and it will do what I want for me and turns out yes I can so I created a new potman network you can give it IP we think IPv6 or not if you want to I mean we want to get rid of legacy IP so you should these days then generate the required JSON that contains the IP address that you want to give the container so we still have that problem and you that contains the exposed ports and then pipe that to net awark setup with a path that identifies the network namespace in which you want to do this and then that will give you in return a namesaver server configuration that you should somehow get into the container I was lazy I just wrote it to ETT Resolve Conf what does that JSON structure look like unfortunately it's not documented at least not in the net awark readme's or documentation and also not in the potman documentation so I reverse engineered it and this is the the rundown of what this is it tells you okay I'm looking at this container ID with this container name here's a list of the port mappings from which container port to which host port how many ports which protocol and it also gives you a list of the networks that you want to attach to and which IPs and you want inside that network and this also gives you DNS so you can get name resolution and the service discovery if you also specify name aliases for that and then there's this block network info here at the bottom that I omitted because it's really just the output of potman inspect on the network right and this is the point where I show you that this works and let's pray to the demo gods because I'm running this over Wi-Fi and if it doesn't work then I have a a TTY recording of me showing this so I have two shells here on a Fedora 38 system can you still hear me while I'm sitting yes great I have two shells here on a Fedora 38 system one is root and one is the test user and I said we want to start out as the test user so let's do that I want to define a runtime directory let's call our container root less and for that reason I'm choosing that particular runtime directory and then that probably doesn't exist yet so I'm gonna make it and then I'm running potman create and this CAD file that I'm specifying here I explain in a second because I automated all of this I'm not gonna do it manually now for everyone and that expects the ID of the container in this particular path so that's why I'm creating it disabling network giving it a name and we're starting Fedora 38 and just for the demo we're starting a Python web server on port 8088 right so this was created now now I need to run this container init command that we saw earlier so potman container init and then the name of the container that passed and now we can see in potman ps-a that we have the container it's in status initialized it's not running yet right so let's get the PID for this so we can see what's going on how am I probably I should probably skip some of this let's go to yeah I scripted some of this and this is the part where we need to start running things as root so as root I have a script here that allows me to set this up now I need to use the exact same runtime directory that I used here so let me copy this and fix it because obviously ID dash you will return something else then I run setup I specify a name that's used to generate an IP address so in this case I just use the container name again then I have a secret that I use so that the IP address isn't predictable that I'm generating test is the user that I'm running the container under then I want to attach this to the rootful zero network and I let me quickly check that this network actually exists so we don't get an error it exists and I want to publish a port that's 80 to 88 so 80 is outside of the container 8080 is inside of the container and I could also specify a network alias here but I'm not going to show you the resolution anyway so let's just skip it I mistype something this exists and container ID exists in there there must be a typo where I don't see it can no no that's not that's not the issue I mean at this point this the case where I'm stopping this and just showing you the recording because I don't have time to debug this right now so you we see the same thing and now I'm sure that this will work right and the advantage says that I have time and I don't have to a type so I can tell you what's happening so we already saw this right I'm creating the runtime directory then create the container without network again again I'm giving it a name and we're starting the exact same Fedora 38 container is created containers initialized then again we see the status is initialized here I'm getting the PID of that container because I'm in this case I'm going to show you some of the network namespaces so what we see now is this is the process that's actually running right the containers and running but there's the C run process which initializes the namespace this also lists the namespaces that we that we have so if you have time to look at the details you notice that the numbers are different behind this so this means that the namespaces have in fact been created both the user namespace and the network namespace and we can also see and that's what this nsenter command is doing here now we can now enter the namespace and look at the IP configuration inside of the namespace and we should expect that there will just be a localhost interface interface because we told it not to set up any network and that's what we see here right so there's no eth0 no other network connect connectivity now is the point where I want to set up the network so again we see the rootful zero bridge exists that's the bridge that I want to attach to this container again the path to the container to the runtime directory that it created set up to tell it to set up rootless is the name and then the secret that I used to generate the IP address test as the user rootful zero is the is the network name then publish because they want to publish a port and in this case I'm also specifying a network alias that will be in the DNS server inside of the container and this is the successful output a lot of JSON you don't need to understand it it just gives you the network the DNS configuration and now we see if we rerun this ns enter command that the Ethernet connection exists at this point right and if we want to right now we need to start it obviously because then the process inside the container wasn't running yet so I ran potman start rootless now it's running and now we still need to test it because you know if it if we didn't test it then it's broken obviously so I'm figuring out my own IP address and using curl to send an HTTP request to the Python web server inside of the container and it works so our networking worked as expected I'll skip the stopping because we're running out of time I also have the same thing automated with system D so I put the exact same commands in a system D service file and the lines that require root privileges system D allows you to just specify a plus at the beginning of the command and I will run them with root so that's a nice trick to get all of this in a single system D service file I'd show you but we run out of time so you'd have to go to the website where I published this and you don't have to scan the QR code now it will be on the last slide again and so no hurries and that also will contain this root for networks Python script that I just used to do this right so what did we achieve so far we now have automatic IP configuration even though you know I had to reimplement it but the script does it for you otherwise Portman would have done it we can expose ports with the system D service file that I talked about I can have this container controlled by system D and that takes care of correct start up and tear down Portman inspect still won't know about this network because we added the networking manually and if we try to run try to use system D notifications that also won't work in this particular configuration because system D will refuse the notify it will see it but it will refuse it so what's next on this before Portman 4.5 introduced this the working dash dash here D-map flag where this just works out of the box I would have said maybe we should add a mode to Portman to drop all privileges except for network configuration now I'm not so sure honestly if you're on Portman larger than 4.5 probably just use you at E-map to do the same thing and maybe there's some improvement to be had for rootless networking there's a talk tomorrow at 9.30 on rootless less container networks get in shape with pasta so if you want to give that a shot right that's it thank you for attending and any questions right the question was then the JSON format for network isn't documented so probably not standardized am I afraid that this will break with the next network update yes very on the other hand it's also a interface across two processes between Portman and network so I know those two are developed in in unison but I'm at least hoping that they will preserve backward compatibility and what I'm currently doing will continue to work it also looked like somebody really wanted to document the JSON interface just didn't so I think that's the lack of time is the only reason why it wasn't documented yeah so the question is I demonstrated that the that said you are deep bit trick won't work because it will map to the unprivileged user what about file capabilities I actually can't tell you I'd have to look into the kernel source code what it does but I'm assuming the the rule that I learned from Michael Karoski has a great training on all those isolation APIs if you have a chance attended what I what I learned from him is the general rule that you need the permissions in the username space that owns what you're trying to access and if that principle holds then it shouldn't be possible because in the user namespace that owns the files which should be the root namespace you as unprivileged user wouldn't have the capability to do that but you know I'd have to test other questions question from the internet maybe then thank you and enjoy DevCon hi so this talk is about RP filter as you could guess I'm believe most of you know what RP filter is or at least have some idea about what it does in this talk I'm going to make the scope a little bit larger because I'm going to first start developing on what the IETF says about what I will call RPF so I will make a distinction for the purpose of this call between RPF the algorithms defined by the IETF and RP filter the Linux kernel implementations and as we will see there are several of these implementations in the kernel so what is RPF let's say we have a router here in the middle connected to different networks and with RPF the general idea is that the router instead of just routing packets depending of the destination address it would also validate some the source address of incoming packets so here for example we receive a packet from network blue and it will verify that the source IP address of the packet actually belongs to network blue and not network red the objective is to limit problems caused by IP address poofing on the internet let's say for example here we have an attacker inside network green and this attacker want to attacks the victim in blue so it sends a packet to the server in red but instead of setting the source IP address with its own address it used the IP address of the victim so when the server receives a packet it replies to the victim and there are two benefits for the attacker first its IP address never leads to the network so the victim has no idea where the attack actually comes from and also when we choose carefully the server we have a nice amplification factor so the attacker can send just small packets and the server would reply with much bigger packets so there are protocols that are famous for doing that like SNMP for example so this problem has been known for a very long time and even some old RFCs like RFC 1812 requirements for IPv4 routers it defines a basic algorithm to do some source IP address validation unfortunately even at that time it was clear that it would break some routing topologies so it was recommended to not turn it on then came RFC 2827 which is more commonly referred to with the best current practice so BCP 38 that belongs to BCP man is just a collection of RFC unfortunately this RFC didn't really provide any technical solution it just talked about using access control list it just say that source IP address validation was important but he didn't say how to do that but the IETF has continued to work on this problem and we now have two other RFCs RFC 3704 and 8704 and we will see what are the algorithms they define so this first RFC defines four different flavors of RPF we have stripped RPF lose lose RPF ignoring default routes and feasible reverse the reverse pass forwarding feasible RPF the stripped RPF is really the simplest and most intuitive version so here we have a router connected to two gateways bringing to two different networks so with stripped RPF what this router does is when it receives a packet from gateway blue it will look at the source IP address and if this IP address belongs to network blue it will route the packet normally and if it belongs to another network either network red or it's not routed at all it drops the packets so this is the original idea that we found even in the oldest RFC the problem is that it has always been clear that it would break a symmetric routing so here for example we have two nodes node A and B they can communicate between each other between two gateways blue and red but node A is configured to only use a gateway blue and node B is configured to only use gateway red so when node B receives a packet from node A from gateway blue it will do the strict RPF control and it will realize that normally the node A is routed through gateway red also it received the packet from gateway blue so it will drop the packet and same thing with node A it receives packet from node B through gateway red but it would route it through gateway blue so RPF strict RPF has actually broken communication between our two nodes so another variant of RPF that was defined was lose RPF so the lose version doesn't actually take the input interface into account so it just verifies that the packet is actually rotable that the source IP address of the packet is actually rotable so in this case we have when the router receives a packet from gateway blue it will just do a route lookup on the source IP address and if the source IP address is rotable it will just route the packet normally even if the source address belongs to network red obviously if the router even has a default route lose RPF is going to let almost any packet go because yeah as soon as it's rotable it flows that's why we have the special variant of lose RPF ignoring default route where the default route actually can't be used for the route lookup but it doesn't really make any difference in practice for example in our case gateway blue or an attacker from network blue could still proof an IP address from network red so the first variant of RPF that was defined in RFC 3704 is feasible RPF feasible RPF is interesting because it uses some extra information that were not used in the other variants so we actually use all the information that are available through the different dynamic rotting mechanisms so let's say that we're using BGP here we have different botanomus system so at the bottom we have AS0 that announces different routes while here it's the same route so it's 2001 DB8 AS-48 it's announced to AS1 and to AS2 and eventually AS4 learns about this route and AS pass is longer on the red side so it for routing purpose AS4 doesn't takes the route from AS3 into account but with feasible RPF for incoming packets AS4 would accept packets with source IP address 2001 DB8 AS-48 as a source IP even though it wouldn't use it to send packets because of feasible RPF and because it was announced on the other interface too but that wasn't enough for all practical use cases so actually the IETF has continued working on this problem and has defined a new RFC which is RFC 8704 with two new variants of RPF so they're all based on the idea of a feasible RPF but yeah the algorithm gets a little bit more complex yet even the acronym as you can see is getting complex so the idea here is again to use the information we can collect with BGP and instead of checking the origin of the source IP address we will check the reachability of the autonomous system that the source address belongs to so for example this is algorithm A so that's the first algorithm defined in this RFC we have AS0 that sends that announced a route to AS1 and a different route to AS2 and finally AS3 receives the announcement for both routes and with this ERPF the enhanced feasible paths RPF will accept packets from any of these prefix no matter if it's received from AS1 or from AS2 because these prefix both belong to AS0 and AS0 is the origin AS in both AS paths in practice there are also problems with routes that are not propagated to all the autonomous systems so the IETF has defined algorithm B so that a network administrator could work around this problem by putting interfaces into an interface group so here we have AS0 who announced a route to AS1 and AS2 it's the same route but AS2 that don't propagate this route to AS3 but this is worked around with the administrator putting both interfaces into the same interface group so now AS3 can send packets with source IP address belonging to this route even if it arrives from AS2 it will be accepted by AS3 because it's in the same interface group as AS1 and it would be accepted if it had arrived on AS1 so as you can see RPF has become something much more complex and complete than just a simple route lookup on the source IP address and we need some information that are not available by the kernel to actually implement them okay so enough with the theory now let's see how the Linux kernel implements RP filter so as I said we have several RP filter implementations we have one CCTL for RPV4 and then the IP tables IP6 tables and NF tables modules that all have their own implementation so that's three IPv4 implementation two IPv6 implementations so here how we can configure them you can see that in the slide offline all these implementations support strict and lose RPF there's a special option for lose RPF with all these modules and beyond the limitation we already saw in the theoretical RPF algorithms we have to face some kernel specific problems so first we have five implementation to keep synchronized and we have the fact that the kernel doesn't handle IPv4 and IPv6 routing tables in the same way and we also have kernel advanced routing features so we'll see in detail what are these problems so first let's see just how RP filter interacts with regular routing tables so here we have a route that is defined to use two different output interfaces so what happens is that with most of the RP filter implementation we can receive packets from any of these interfaces just the NF table IPv6 implementation only allows one of the interfaces if you we do the same kind of configuration but with more recent commands let's say so we who if we use next hop groups instead of a single route with several next hop as we did before we get the exact same results so I'm not going to go too much into the details about why it's the case but it happens that the problem stands in how RP filter handles ECMP routes because what we did in these two examples is create an equal cost multipass route so yeah RP filter and the IPv4 and IPv6 implementation handle that in different ways which leads to different results now let's see what happens when we define the same route twice but with different gateways so we can do that with IP route append and let's see what happens so this time we don't have an ECMP route so what happens is that most most implementations with will only accept the first route and only the IPv6 tables RP filter implementation will accept packets from any of these interfaces and again that's slightly handled about how the kernel handles ECMP even though we didn't explicitly want an ECMP route but IP IPv6 internally converts these routes into a single ECMP route and yeah the rest is implementation detail that yeah I'm not going through that now because we won't have time now let's see a more common use case which is to define the same route but with different preference so here we have preference 1000 and preference 2000 they use two different interfaces a preference is like metric you can use yeah whatever the keyword however you call it it was the same way so the favorite route is eth 0 and eth 1 is just the full background so what is RP filter RP filter supposed to do should it accept packets on eth 1 or only on eth 0 and the result is that almost all implementation accept packets only from eth 0 but again IPv6 tables behaves differently here it's slightly related the road course is slightly is the same as for the previous example this is because yeah this implementation constrained the route lookup for IPv6 RP filter but yeah again that's maybe something we can talk about later if we have time these are problems that probably could be worked on in the kernel at least to make the different implementation behave similarly but there are also problems that really are more fundamental problems and that are not only dependent on how the code is written so let's talk about policy routing policy routing for those who don't know is when we use different routing tables filled with different routes and we jump from one table to the other depending on some particular patterns so here for example we have the main routing table that uses eth 0 and we use eth 1 for table 100 and we decide if we do our lookup in the main table or in table 100 depending on the destination port so here if destination port is 50,000 we jump to table 100 and if not we jump or we stay in the main routing table so let's see what happens and in this case what happens that only the RP filter cctl the IPv4 so we jump to table 100 so it will roll packets it will accept packets from eth 1 if the source port is 50,000 so here RP filter just swap the source and destination port which probably is intuitive because actually that's what happens for the IP address but we'll see that it's not always the right solution and for all the other implementation it's simple the destination port is not taken into account so we just never jump to table 100 so jumping to table 100 when we have source port 50,000 might look like the good idea because when we send the packet to destination port the answer will come back with the ports a source and destination port swapped but that's not always the case especially with UDP tunnels like Vixler or Geneve we always use the same destination port and in this case this behavior is not appropriate so let's do policy routing but on something different let's use a packet field that we can't swap between the request and the response so let's use the DS field DS field is kind of like the toss for IPv4 or the traffic class for IPv6 those should be obsolete and should be replaced by DS field but you get the idea of what it is so let's see what happens and if there are some differences between our different RP filter implementations and actually all implementation work more or less the same way I'll leave some details but we can consider for this talk that they behave the same way so what happens is that if the DS field from the written pass matches the IP rule then we do the wrap lookup in the table 100 so again that probably looks intuitive but in reality the DSCP doesn't have to be mirrored on the return pass so if the if the return packets don't have the same DSCP value than the outgoing packets we're not going to jump in the same table and RP filter will break connectivity and we can also do policy routing based not on IP packets or IP packets fields but on metadata of the packets for example we could use the packet mark the socket user ID the input interface and more but we are only going to consider these examples so packet mark most of well all of the RP filter implementation won't take the packet mark into account for doing their wrap lookup because the problem is that again we need to have the same packet mark on outputs and on input to get to jump to the right table and often it's difficult enough to have the packet mark be symmetric on input and output so we have to use a special special options to make RP filter no matter the implementation respect the the packet mark and we have also problems with a socket user ID so we can jump to a different routing table based on the user ID of the socket that sends the packet so that's for locally generated packets but on the return pass of course we don't receive the packet from the socket we receive them directly from a network interface and we don't even know yet on which socket it will be delivered so all the RP filter implementation make the wrap lookup with socket or the user ID 0 so as if it was sent by root and for input interface we have the same kind of problem so if we do some policy routing based on the input interface for when we send the packet that has just been routed we know on which input interface we received it but on the return pass how should we do our wrap lookup if we want to reverse input and output interface then we should say the input interface is the actually the output interface we would use but this also is not known all the time this depends on where the the rule is on the net filter hook so that depends not only on the implementation but also on where the rule is inserted so to summarize the problem with policy routing we have different set of meta data available in the transmission and reception pass we have packet information that might be different on transmit and receive pass and even for something as obvious source and destination port where we say well we should just swap the source and destination port that doesn't work all the time so really we have a fundamental asymmetry between the receive and the transmission pass which makes policy routing mostly incompatible with RP filter so to summarize the world talk RP filter even if we consider just the theoretical part of it and only the IETF work it's not just a simple on-off functionality we have to select which flavor of the algorithm we want to use for advanced algorithms we had we need some cooperation with dynamic routing daemon we might need for if you remember algorithm B some special configuration from the administrator also we have to keep in mind that RPF was designed for routers and especially for ISPs not for end host and yeah routing daemon I talked about it and then we have the implementation so we have five different implementations for RP filter which are hard to keep synchronized we have all the complexity of how the kernel manages the kernel route lookups and that has some side effects on the implementations there are some things we could improve like the handling of equal cost multipass in order to make all implementation behave similarly but there are also some fundamental incompatibilities with advanced routing in particular every time well or many times where we use IP roles also we don't support and can't really support currently the advanced RP filter algorithms because we need some cooperation with BGP daemon okay so thank you and time for questions yeah so the question is can we detect if automatically if some IP rules are going to create problems when we use RP filter we could at least detect that we have some special IP rules and then consider this is a potential problem for RP filter but the biggest problem is yeah what are we going to do if we activate RP filter and we detect that there's going to be a problem do we just disable RP filter entirely because there's no correct way for some of these IP rules there's just no correct way to handle the problem automatically we need some help from the administrator yes yes so yeah the question is is it primarily primarily a mean to manage the problem under the problem of IP spoofing on the internet yes from the IETF point of view and the reason why RPF was designed by the IETF yeah that's the main reason that's really the reason yes so what the IETF recommends is that you activate RPF as close to the customer as possible and if possible even on the first router and then if you have a very big broadcast network behind this well probably you also have also security problems if you have a big internet segment yeah again on linux we can activate it on router or the N-host or RP filter implementation they work similarly on the router on routers on and on N-host is just make less sense on the N-host because if it's not going to change anything for the IP spoofing yeah yeah okay yes repeat the question the question is should we disable it on the rail by default I know this question has been answered already some years ago and it was said that no we shouldn't disable it by default because it's security and we don't disable security and there's no really technical argumentation it's just a fear of accepting packets that were not accepted before so yeah the reason is this yes no more question thank you