 Hello everyone, my name is Leon Romanski and I'm coming from Mellanox and in this talk I will present how Mellanox provided a development environment for Linux kernel developers to do the development and testing inside Docker containers. This project was started by actually two people, me and Jason Ganthorpe, both of us are very associated with RDM sub-system. Jason Ganthorpe is an external maintainer in RDM sub-system. He is responsible to send patches to the managly, to the linus and to review managly. I'm in kernel to Mellanox maintainer, which means that everything that goes outside from Mellanox towards RDM sub-system both kernel and both user space are passing through me. It means that we do a lot of reviewing, but being maintainer it's not only to do review, it's also to develop and just to present the amount of patches which we both handled, I added statistics from 2018. So in average, both of us submitted something like handle, developed and reviewed, only what was visible approximately two or three patches every day, the amount of patches which were tested in-house, reviewed in-house, slightly bigger. So we wanted to do the same amount of work as we do both of us without any trouble and without any delay. So at the beginning we started to think what is the perfect solution from our point of view. The perfect solution is developers should not be aware about the difference of operating system. The difference of operating system doesn't mean that linus can be developed, it should not need to know about the operating system, but I mean that sometimes from operating system, one distribution to another distribution is something brave. For example Fedora 30 by default doesn't bring interfaces, while Fedora 29 worked perfectly. Fedora 26 worked perfectly with Serial Console, but Fedora 27 provided broken Serial Console. And we actually don't expect from our developers to be aware of such subtle things and we wanted to remove this burden from them, so it was perfect. And our important thing from our perspective is we wanted to bring latest development environment. Latest development environment means the latest GCC, latest C-Land, latest snatching, latest sparks. And usually from what we saw, developers are installing something with some operating system on their computer, it can be on laptop, it can be on their server, and from that point they don't upgrade anything. So we actually saw people who started with Red Hat 6 and 5, Red Hat 5, Red Hat 7, and it is nice, but it's not good for general development at all. Because these distributions have a variation, GCC and GCC don't catch a lot of errors, which simple completion of modern distribution can't catch. And our important thing for us, as I mentioned, it's supposed to be work on every machine. Because when we are talking about scale and we are talking about many developers, these many developers have a lot of different operating system. It can be like I said, Red Hat, it can be Ubuntu, it can be Fedora, and we don't want to limit anyone and choosing operating system for what we would like. We also want to, because the current developers almost are doing it's constantly writing code, building code, and testing code, writing building code. So it means what this loop of writing code there should be as fast as possible. It means we don't want it to be actually not installing code, installing our kernel to test it. So all solutions which requires virtual machine and after that boot into Vintro machine, compile inside Vintro machine, compile outside Vintro machine, run make modules, install make install inside Vintro machine, doesn't work fast. Because it actually slows our development. So our perfect solution should work without any installation, without directly from source. Because at the end, all development is around source and patches, and we don't want to change anything. Extra thing, because we are talking about ability to run anywhere, me and Jason are traveling a lot. So we need the ability to run our CI system without any internet, without any connection to the outside world. So we brought CI into the system. I forgot to mention, actually everything that I'm going to present here is available in our repository. Open source, everyone can see, everyone can change. So it's not something that we are trying to hide. Because CI, it means that we also need to provide fast build. No one wants to compile this kernel and wait for results minutes and diamonds. So a fast implementation of build process, which takes into account the fact that we can use CC cache, or we can properly understand the number of CPUs and believe me or not, but in real life scenario, when you are drowning between servers, you're not always familiar how many CPUs exist in this specific server. And in case you are missing the number of CPUs, you want to fully utilize your server. So is it possible? Is it impossible? So, as I said, an ideal, very ideal world, like how verification is what we are living in this world, developer writes only once, builds only once, and tests only once, everything works perfectly. At least for me, it doesn't work. I constantly do, I always make mistakes. My code never compiles from version 0. It never works from version 0 too. Usually it doesn't work from version 100. So I need this loop very fast. We try to take a look on different solutions. Everyone who is learning Docker and learning kernel, is now creating his own Docker container to build kernel. But most of them are not using anything smart, like a CC cache. But let's put aside, we have plenty Docker containers to compile kernel. We have less QMU run inside Docker. One of the most known QMU runners is called Birkney. Birkney doesn't run QMU inside container. It means that once you will run it in scale, you will hit different problematic behavior with a different QMU. Different operating system, different situation. It also has a very important limitation, at least for us. But Birkney uses a busy box to run anything. But our customers run real operations system. It means that our verification runs real operating system. It means that we as a developer should also run real operating system. We cannot allow for us to run a busy box, something that is connected from the rest of the company. It wouldn't allow us to reproduce bugs with the actual development. And also because we are coming from argument, from department, it is also extremely hard to put argument inside the busy box. Extra runner, which is also known, it's called Docker QMU. And Docker QMU is almost perfect before, except one thing. Docker QMU runs virtual machine, runs QMU together with virtual machine. So you need to create virtual machine image. And every time you update internal, you need to create, maybe recreate, maybe reinstall. And it actually loses the whole point of running of cloud solution. And definitely plenty of options to run precompiled kernel, but not actually the whole holistic solution of how to take from source to running kernel altogether. So as I said, everything is available on GitHub. I need to warn you because our main focus was to provide for Melon developers. This repository contains some specific Melon of code. It is not many Melon of code. It's only the place where we are storing precompiled images of containers and how we are grabbing source from our internal depositors. Everything else more or less general and common. The architecture itself is pretty simple. Because to make a simple solution, you actually need to provide something simple. It's based on three layers. First layer, whatever you want, it's called hypervisor. You have already existing server where you have already stored your source code. It means what you don't need to change anything. The storing of this server, your code, you will run your editor on the I or click whatever works for you. The store on this hypervisor, on this server, built artifacts and blocks. The second layer, it's actually run a simple Python script that tries to hide complexity of running Docker and providing different parameters to the Docker. And run an image to create images from Docker files in case you don't have access to precompiled images and Melon that always has access to precompiled images. You can create all these images by yourself by simply writing a simple command. Like empty images and it magically creates all images locally. So this is why I said it's not a big deal because we have Melon of specific code. It has run it to actually run QMU in your machine. It's CI to check your patches. And when I'm saying CI, I'm meaning static analytics. And build to build smart build project. We are not only interested in build kernel, we do develop QMU. We do a lot of work on IP route, RDMA core, which is a supplementary part of RDMA. RDMA subsystem has its own kernel part and user space part which is called RDMA core and it's not possible to work without corresponding part. And the fourth layer is actually container. At the end we do only two operations, compile or run. So we need only two containers. One is to compile CI, it's the same compilation, the same CI, the same build operation and the run container. Intellection itself is pretty neat. All what you need, go to the repository, download it and put it on your path. After you will put it on your path, you will have a number of simple commands. First command is called setup. And setup itself does Melon of specific thing which may or may not you will not need, you probably wouldn't need it. It install Docker C because we cannot rely on Docker inside operating system and we found what Docker inside operating system. It's pretty unusable. It brings sources and it updates machine to latest, with latest updates it just was my personal decision to make updates to everyone actually. And fourth thing which setup is doing, it's taking configuration file in places into your home directory. So because we are working with NICs, sometimes we have different setups and not only one machine, we have a number of machines. But we don't want to run on every machine compilation. So we are providing setup and master combination which actually creates NFS share on master which is read only for all slaves and connect the slaves to it. So it allows you as a developer to develop only on one machine without need to actually copy paste any artifacts. Because we are using very small and very controlled kernel, it works pretty well on NFS too. So I said what we do, what we have container for compilation. We need it for CI, we need it for built and we need it for another third thing. Third thing it's called support application. Sometimes you want to bring something bleeding edge. You don't want to rely on Fedora specific repository. For example, here I'm bringing smatch with specific git repository. What I'm writing shell script which is shell script, pretty simple shell script which package smatch and creates RPM. It brings from specific repository, specific tag and applies specific patches after that to create image with specific version of this support package. For example, we are bringing special version of smatch, special. We are bringing upstream version of smatch just to be sure what once in a week, two, I'm updating this file to bring the latest and greatest smatch and sparse. Smatch sparse and we're also bringing latest Selang because Selang 9 is not available yet in Fedora and so on. Build code, it's actually pretty easy. First of all, you need to configure properly CC cache. Unfortunately, most of our developers didn't know about CC cache or didn't configure it properly. So this itself, CC cache brings a lot of performance gain for recompilation and because the same CC cache you are using for your CI, it also gives you pretty fast performance for CI and for built itself. These built scripts properly understand on which project you are working. You can go to IP route to, CD to IP route to your source director of IP route to, run MKT built and MKT built will build your IP route to with proper flags and proper, properly to be used inside this container. The thing which actually makes compilation fast, it's small config file because MKT is actually full solution from the beginning till the end. We have full control over configuration file over all components of kernel which we will use. It means that we set everything to know enabled only Virtio, Virtio 9p and Virtio PCI because we are actually relying very heavily on Virtio PCI. Enable debug options for kernel debug. In this case, we enabled Melanox specific to be able to run Melanox specific nicks and that's all. That's something which actually provided the most visible gain to the compilation. This config it's too strict but I think it's not possible to use in VM as well. So it's only in such control environment where it can actually control everything. This is the example of me running MKT build on my kernel and it takes pretty small amount of time. Because of CI testing which is also the same build container, it means that we're running all together with extra warnings, running smatch, running sparse, running ceiling. Once you will run on your kernel, it will produce a lot of noise, a lot of warnings. For example, smatch on our subsystem provides something, smatch and sparse something like 600 warnings or errors which mostly, I'm not saying false around but it's hard to distinguish between what is important. You need to actually differentiate between what is relevant and what is not. So very naive solution will be let's run it twice. Before patch, store some input, some output, run again and perform diff. But it's very naive solution and it complicates and it also provides complexity of understanding which line introduced because all lines changed and it also doubles compilation time. So our solution was pretty different. We are running only once and we, after that, filtering all messages and we running git blame to understand if these messages actually belongs to the lines which patches we added touched that line. It is not bullet proof but at least for now, it avoids any report from kernel build, from zero build. So from our perspective, it works pretty well. But we are aware of possibility to miss errors. Extrafink, which also allows us to run CI very effectively, we are taking a look on diff start of specific patch and see on which folder this patch is supposed to be applied and we are running only this folder. kernel make provides an option to run only on folder and you don't need to run everything. So run flow, it's also pretty stressful. We do want for our developers something simple, something what we run without any extra parameters and we get default and decent behavior. Because I talked about it before, we don't want people to run VM. So inside QMU, we are actually creating links back to hypervisor, back to the location of your models instead of installing these models. So here is an example of some of the models which actually linked back to my hypervisor and this is an example of how it looks inside QMU, inside virtual machine. We can skip it. Simple configuration file. So how do we do it? The main thing and this is something what I didn't find in any GitHub repository that I tried to look before. We actually do it, we are relying on the fact that we are running QMU inside Docker and once you are running Docker, you already have a proper virtual image because it's based on something. It means what we don't need to provide to create for QMU and our extra virtual image, we need to instruct QMU, use image from Docker file. So it means we need to mount as a preparation step for executing QMU inside Docker and Docker when it starts and enters into entry point and this entry point is pretty large Python script which does, first thing, mount a root file system. It creates system defiles because we already mounted, we can create on this mount point system defiles to mount folders from hypervisor which were passed by runners to the image, to the container image. The system defiles will be visible later on by QMU to mount them inside QMU. We configure QMU to work and pass through file system mode. There is something, I don't know if it's common knowledge or not but it took me a while to find it. So you can actually configure QMU to work with pass through file system and boot into QMU. Because we configured the pass through, it will go back to the Docker container. So at the end, it's only three commands to be executed. One is mount and second command is specific parameter to QMU to avoid creation of VM image. So in this way, we actually avoided creation of VM image and installation and whole installation of kernel inside of it. The network itself, it's pretty complicated because we have two type of interfaces. First interface is called managing interface. The one which you will connect with SSH or the one which you want to go outside of your network and here comes something specific. We do rely on the fact that on your machine BR0 bridge 0 is already configured as an interface outside. In case it's not configured, you will get local IP address, local host address which will still allow you to go and to connect with SSH but actually it's pretty easy to configure BR0. And the second type of interfaces are the interfaces which we are actually testing. And because we are network company, there is a catch in testing interfaces inside kernel. In case you have a number of PCI devices which actually connected with external link, kernel will know what these devices are inside one kernel and will not send packets outside of kernel and will reroute all packets inside kernel. But this is something that we don't want to test. We want to test actual other devices and we want to be sure what all our packets are going outside of external link and goes back through external link despite the fact that these devices are running on one kernel. So the solution is also both simple and complicated. We decided to go with routing table with a number of solutions. One of the common solution is to use namespaces but because we are coming from RDMA subsystem, unfortunately namespaces doesn't work well with RDMA traffic and also namespaces has a complexity of actual executing command to execute in different namespaces and we need to run IP exec and the name of namespaces so it complicates all the development flow all the testing and so on. So routing table, it's the way to go. I'm here simply posted how it's supposed to be done just as a reference. First of all, we need to disable different routing modes like RP, like ARP. This is an example just for two nicks. We need to configure routing table to be sure what this routing table allows passing through different interfaces. We need to change priority and need to flash cache. After that, all your pinks will actually send packets outside of the var. This code is not yet part of repository because I wanted to add something generic not only for two devices. Hardware support is pretty easy. It's actually a very simple thing. Everything inside our images is based on virtual IPC. It means all what we need is to unbind from real device on hypervisor, bind to VFI and provide and send to KMU with password and password mode. So as I said, everything is available online and we would really like to create this tool to be more generic when it's now. Simply join us, help us, and we will help you back. Any questions? Okay. Thank you.