 Hello everybody, so I guess we can start. Thanks for coming. I will talk about Tressloop on Inspector Gadget. First, my name is Albon, I work in KidFolk as a CTO and one of the co-founder on working on a lot of Kubernetes and BPF things. So Tressloop on Inspector Gadget, that's two open source projects on GitHub. Tressloop est un système de tracement qui s'appelle un peu similaire à S-Trace, mais dans une façon différente, il utilise un C-Grooves, un BPF et un ring buffer. Je vais parler plus tard. En Inspector Gadget, il n'est pas un single node, mais au niveau de la Kubernetes. C'est une collection de gadgets de différents types pour tracer différents aspects comme l'opinion de nouveau file ou l'entraînement de nouveau processus, etc. C'est une collection de gadgets pour les développeurs de Kubernetes pour les aider à débarquer l'application. Je vais parler un peu du contexte de pourquoi ces deux existent. En KinFolk, nous avons un stack avec une distribution called Flatcar Linux, qui est un O.S. optimisé pour les contenus. En haut de cela, nous avons un locomotive qui est une distribution de Kubernetes. Ce type de tools, Inspector Gadget, sont des tools qui sont en haut de cela, qui offert des tools à plus de niveau pour les développeurs de Kubernetes. Tout cela est un open source. En Flatcar, nous avons une nouvelle distribution, nous avons un nouveau canal pour les releases. Nous avons l'alpha-beta-stable traditionnel. Quand quelque chose est nouveau, il y a l'alpha-stable, et après un while, il y a l'alpha-beta-stable et quand l'alpha-stable, il y a l'alpha-stable. Mais récemment, nous avons ajouté un nouveau canal de distribution called Edge. C'est pour l'expérimentation, je pense, donc il n'y a pas de garantie. Mais nous pouvons l'expérimenter en ajoutant quelques nouvelles features de BPF ou des features de groupes de groupes ou quelque chose d'autre que nous voulons expérimenter. Donc le contexte est de travailler sur cela. Mais, à travers l'expecteur Gadget, en ce moment, on n'a qu'à utiliser les features normales de Linux, donc il n'a pas besoin d'acheter sur Flatcar ou Locomotive. Il peut également s'adapter à un système de Linux. En inspecteur Gadget, il n'a pas besoin d'acheter sur Kubernetes, mais il n'a pas besoin d'acheter sur une certaine distribution de Kubernetes. Ainsi, à ce moment, c'est beaucoup plus facile de commencer avec Locomotive, parce que c'est un peu à l'âge, donc il y a besoin de nouvelles features dans différents places. Donc avant d'aller plus dans les détails sur ces deux outils, je vais donner une introduction à d'autres tools de BPF pour que l'on soit plus à l'âge de l'expecteur Gadget et qu'il s'agit de l'envers du travail. Tout d'abord, la PCC. Alors, je vais juste vérifier combien de gens ont eu l'air d'utiliser la PCC. Ok, c'est tout le monde. C'est bien. Donc vous savez déjà que c'est une collection de BPF programmes qui servent à deux propositions. C'est une collection de tools qui sont vraiment utiles pour l'utiliser dans votre système Linux. Mais c'est utile pour l'apprentissage de BPF parce qu'il nous donne beaucoup d'exemples. Et il vous montre comment vous pouvez faire des choses dans différents terrains, dans la chaine, dans l'utilisation, etc. Je vais rapidement vous montrer le site, si je peux. Juste pour que vous veuillez voir les listes de tools qui sont là. Il y a beaucoup d'exemples. C'est ce que vous avez vu avant. Il y a beaucoup de tools où vous pouvez voir la source code dans BPF, sur l'exemple de comment faire des choses. Donc vous pouvez voir comment ça marche. Mais ceci est quelque chose qui est pour... C'est sur un single node. C'est généralement des tools de commande. Ça peut être dans un contenu. C'est ce que je vous montre ici. Ici, dans un contenu docker, il y a des tools de BPF. Il y a un autre tool qui est séparé de BPF, qui est indépendant de BPF Trace. Avec BPF Trace, vous pouvez mettre beaucoup de codes de code en ligne. Il n'utilise pas les mêmes langues de C pour mettre votre code de BPF en BCC. Mais c'est une nouvelle langue avec l'inspiration d'Ork and Sea. Je vous montre le site. Il y a beaucoup d'exemples. Ici, vous pouvez voir quelques lineaux. Dans un single line de code, vous pouvez ajouter un point de trace sur l'exécution de BPF programme qui ne pourra pas avoir beaucoup de codes de C. Tout le code de C, vous verrez le site de Coulos, qui se trouve à 3 000 tangents. Il y a d'autres exemples avec d'autres lineaux et beaucoup de tools basés sur BPF Trace. C'est un autre moyen d'ajouter des programmes de BPF qui sont utiles pour tracer. Ici, c'est le même screenshot. Ceux deux tools de BPF Trace, les tools de BCC, sont tous un single line. Mais ils seront utiles pour le niveau de la cubanité. Qu'est-ce qu'on doit faire pour adapter ces tools à la cubanité ? Normalement, la granulité de la tracation doit être différente. Dans la cubanité, nous carenons des portes. Nous ne carenons pas de tracer ce processus, ce PID. Mais nous sommes plus de haut niveau et nous voulons tracer un poudre. Normalement, nous ne voulons pas de tracer ce poudre. Nous voulons avoir une interface où nous voulons tracer ce poudre et le système va faire ce qu'il faut. Nous devons adresser le poudre à l'aide de la cubanité. C'est un moyen très utile. Si vous savez que la cubanité est utile, vous savez que c'est plus pratique que d'always vérifier le nom du poudre. Ce que je veux, c'est que la cubanité est utile comme l'expérience de l'usage où les développeurs ne devraient pas s'assurer des nodes. Mais d'ailleurs, nous avons utilisé l'interface traditionnelle de la cubanité. Il y a un projet qui s'appelle la tracation de la cubanité. Donc, ce qui fait, c'est qu'il tue les programmes de tracation de la cubanité et on les schedule sur un nôtre de votre clé de communauté. Donc, il y a beaucoup d'agoles qui ont été listés avant, mais il y a un couple de choses qui sont un peu fausse que je vais vous montrer après. La Tracation de la cubanité fonctionne de cette façon. Sur votre laptop, vous typez la commande de la Tracation de la cubanité. C'est à dire, c'est un plug-in dans le tool CLI de la clé de la cubanité. Et quand vous typez ça, ça exécutera la tracation de la cubanité. Et puis, à partir de là, il ne sera qu'à parler de la service API de la cubanité. Il ne sera pas SSH pour un nôtre, mais qu'il soit utilisé dans la façon normale de la cubanité de créer des portes, créer des maps configs, des configurations, etc. Donc, quand vous demandez de tracer quelque chose, le service API va commencer un port sur la cubanité, appelé Tracerunner, qui va exécuter votre programme de BPF et ensuite installer le programme de BPF dans le canal Linux. Et puis, il va pouvoir obtenir les logs de ça et retourner à l'UE. Donc, c'est un couple de choses un peu faibles dans le traitement de la cubanité à l'époque. Ce n'est pas si facile de filtrer le traitement sur un point spécifique. Parce que, à l'époque, les BPF traits étaient plus en train de tourner sur un système où vous pouvez filtrer sur un PID. Mais, sur le niveau de la cubanité, nous n'avons pas l'interface de l'usage où l'usage peut vérifier quel PID est en train. Donc, il y a un couple d'issues pour filtrer ça. En tout cas, il y a un demo que j'ai essayé quelques mois auparavant où vous pouvez filtrer le traitement de la cubanité sur le programme BPF qui est ici sur la ligne de commande. Et ici, je vais pouvoir filtrer sur un point spécifique par dire que j'ai envie de filtrer tout dans ce groupe. Ce n'est pas si pratique parce que dans ce groupe, vous devez spécifier l'idée de ce groupe. Et ici, vous devez trouver quel groupe vous avez filtré et ce n'est pas parfait de l'interface. Oui. Donc, à ce moment-là, ce que j'ai parlé de la base, j'ai parlé de la base BPF que vous pouvez filtrer sur un ordre sur le traitement de la cubanité que vous pouvez filtrer sur le niveau de la cubanité sur la base BCC. Donc, la prochaine partie de la base BCC c'est ce que nous pouvons faire avec la base BCC pour filtrer ça sur la cubanité. Donc, je pense que vous avez l'auparavant, ce que je vais parler de cet inspecteur gadget. L'inspecteur gadget est une structure similaire pour la base BCC. C'est aussi une base BCC aussi. Donc, je vais vous montrer un démon qui est ici. Vous pouvez voir ce qu'il y a sur la chine ou est-ce que c'est très noir? Ok. Est-ce que c'est mieux? Ok. Je suis sûr. Ok. Ok. Je suis désolé de ça. C'est bon? Donc ici, je vous montre un support gazeuse pour l'ins charming About Sorry! Quoi? Oui Oui. répétitively. If I check the logs of that, you see it doesn't work because it doesn't have the white capability that it doesn't configure it correctly. So now I will show you a tool in Inspector Gadget that could help to debug that. So Inspector Gadget can be called on the command line like this, where you have a list of gadgets. One of them is called Capabilities. What it does is you can select a pod and it will show you in real time what are the capabilities requested by that pod. So there is a different way to select the pod. You can give the pod name or you can select by label. So for now I will just give the name and hopefully it works ok. It requires a lot of capabilities here. Oh sorry, I didn't set the pod name correctly so it was showing all the process instead of filtering correctly on the pod I asked for. So now as you see it request only Capsis Nice, which is what you expect when you can run the command nice. And it does that every 5 seconds because in the script I had a Sleep 5. I have other gadgets to show as well. For example, ExactSnoop, it will show you every time a new process is created, what is the command of that process. And I can filter by label and then I can say I want all the label, my app equals something. Now it should only filter the process in that pod with that label and every time this process, this pod create a new process, it's displayed that here. So that's it for ExactPod. There is a few other gadgets, some of them are not implemented yet. Some of them are more or less complete. But there is some to trace network TCP connections or every time a file is open and so on. All of this is not created from scratch, what I did I just picked the BCC scripts that already exist with a small modification to adapt them and then InspectorGadget run them. So all of this so far relies on what is already written in BCC. So what are the challenges of doing that of writing InspectorGadget? The first one is how to select the pod. Because when you install a BPF program with Kpro about response, that's something installed on a Linux kernel and the Linux kernel doesn't care about containers, it only see a lot of process. So the BPF program will run for all the process on all the containers. There are a few BPF helper functions that can be useful for selecting, filtering what you want to see or not see to get the name of the process but that's not really useful because name of process can be changed that doesn't tell you which container it is. The PID, but we want to trace several PID, not only one. What is more useful for me is filtering on a C-group ID because every container should run normally in a different C-group so that's a good way to filter the pod. For this, I need Linux 4.18, so a fairly recent one. Another thing, here I talk about C-group ID but there are actually several C-group subsystems on Linux. There is a C-group version 1 with several ones and C-group version 2. This BPF helper function only returns the C-group version 2 ID. So how it works on SystemD, you could have a directory like that with C-group version 2 hierarchy and then on this directory, you have the system called name2handleArt, which returns the signal, I will try to reply, ok, it's back, cool. So this system called name2handleArt returns the C-group ID. The problem is on Kubernetes, Kubernetes normally only uses C-group version 1 so far. So on here, I need C-group version 2, so what can be done? SystemD helps us a lot here because in SystemD, you can configure it to enable both C-group version 1 and C-group version 2 and then all the other components that create C-groups, we need to take care to configure them to let SystemD set up the C-group system and not let them create C-groups by themselves. So there is a lot of configuration but in Flakar Linux on the Edge of Channel, it's already done for you. Then in the BPF program, you need to filter on a specific C-group ID. So here I have some pseudo BPF code, so it's a bit more complex than but it's just to illustrate where when you ask on the command line to filter only on one specific label. What I have is BPF maps, which are a kind of global variable, which are prefilled with all the Kubernetes labels for the pods that exist on the system. And the BPF program will look up the C-group ID and look for this C-group, what are the BPF, sorry, the Kubernetes labels, and filter on them. So at the moment, that's a proof of concept that works like that with some string manipulation. It's back. But there are work to be done to be more efficient working with ID instead of strings. How do we prepare these BPF maps with all the Kubernetes labels? What I do is I use OCI hooks, so when Kubernetes, every time a container is started, it will start RunC by default, and run OCI hooks, well, I did a script there to fetch, every time a container is started, fetch the Kubernetes label and fill those BPF maps. So if you want to try inspect a gadget, it's easier to do it with FlatCar Linux on locomotive so far, but you can look at the instruction for installing that. You can also do it yourself on the other system, it's a bit more difficult, but you can get help from the instruction on GitHub, so it's what you need to do there. Okay, so now I go to a more interesting gadget, an inspector gadget, which is called a trace loop. So the use case is, as a developer, I like to use a trace to get to know what syscalls are executed when I want to debug an application. But a trace is a bit problematic, it's slow, so you cannot just enable a trace for every process on all your cluster, that will not work. On some time, I want to debug something that has just crashed, and it's too late to add a trace there, so I cannot really add a trace retroactively if the program just crashed, and it doesn't crash reliably. So the idea is to have a flight recorder system that records all the syscalls all the time for all your ports, but in an efficient way, which doesn't have the disadvantage of trace, which will not work for enabling everywhere. So that's something trace loop does with BPF. Everything is recorded all the time, but it's not saved anywhere, it's only saved in memory in a ring buffer, and only when needed, I will get this information. So if I compare stress on trace loop, it has several differences. Stress capture the syscalls using ptrace on Linux, trace loop use BPF instead, trace, you have to specify which process you want to trace, so the granularity is the process, trace loop instead does that on the syscalls level. So all the process in that syscalls will be traced. Stress is slow, trace loop is fast, and that's because trace is reliable, so it's working in a synchronous way, you get the events as they happen, and it cannot lose events. In comparison, trace loop is working in a synchronous way, so it writes events in a ring buffer. So if the ring buffer is full, then events will be overwritten, and you, theoretically, it's possible to lose events this way. But still that makes it very useful for debugging. How it works, that's a bit different than other gadgets. I install BPF program on the trace point syscenter, so every time a syscall is executed for all process on the system, it will execute this BPF program, and then it will check if the currency group of that process need to be traced or not. So for all ports, it will redirect the execution to a different BPF program, and then each pod will have its own ring buffer, where the syscalls are stored. So the syscalls are permanently stored in that ring buffer, in memory, but then it's only sent to the user space, only when the user request it, so normally it's not saved continuously because it will be too slow, but only undamaged. Ok, so I will show you a demo of trace loop, first as a standalone tool. Ok. So what I will do here, I run a trace loop, but on the specific syscalls, the one for SSH. So now it's starting, it's recording everything that happened on the syscalls, I will put the process in the background, and then let me just start a new terminal and do SSH, so something should happen here. Ok, and actually I didn't need to put it in the background, so now it's recording everything but nothing is visible because nothing is sent to user space, and when I stop this, here it dumps the last syscalls of SSH that were in the ring buffer. So I see differences calls with the parameters, like this. And then I can show you how it works on Kubernetes, so not only on the command line here, but using kubctl, so if I do kubctl get pod, I can, sorry, I will start a new pod. So this is doing some operation in the scripts, a multiplication, sending that to the BC program, saving that to a file, and then attempting to print that file. So if I do that, what will happen? As you might guess, it will not work correctly, so here I don't know if I have internet working or not, but because this file is different than this file. So I did a mistake in my script, I don't use the correct humanities cluster, sorry. So here there is an error. So what I can do now is with kubctl gadget, use trace loop to see what I can do here. So here I can list the existing trace. So even through my program finished and terminated, I still have a trace that has kept a little while, and I can show this trace using trace loop, and now it lists the last same calls of the program. So if I go, sorry, does it work? Cool. I think I have a problem in my demo. I didn't trace the right thing, but you get the idea. So here this is the kind of trace you can see. Here you have a white system call. For some of the argument, I print the buffer, and some is not implemented yet. How does it work behind the scene? It uses linux perf ring buffers, that if you want to see the documentation, it's in a manual page for perf event open, and here I can put a different kind of message. They can be of different size, but because of the way it's implemented, it needs to be a power of two size. So what does trace loop do? It adds a message in the ring buffer, every time there is a new syscall on syscall enter, and for every parameter that is traced, we add an additional message, and then when the system call returns, another message, and all of this message are reconstructed to reconstruct the system call in user space on demand. So all the system calls on linux are different with different parameters, so trace loop pass, the description in ccfs, which contain text format, what are the different parameters, and it pass data start-up to know the different format. This is a method which is inspired from trace left, which is another ebpf syscall framework that we work on before. And then, after having passed this other syscall format, this is send to a bpf map in memory so that the bpf program can read that and know for each system call what to do. So from that, it knows that the system call close doesn't have any string to the reference, the system call open as a path name, and the system call write as a buffer with a size specified in a different argument, and so on. So lastly, I would like to show what, how trace loop can be integrated with systemd. So that's more a wish list than something which is well implemented, but I will show you what the idea is. I would like if systemd run had an exact start pre command, so I could say here start to record whatever happened on this systemd unit, and then run the program. So this option doesn't exist yet, but I think that will not be too difficult to add. And then what it will do in the pre command, it will inform trace loop, which cgroup should be monitored. So since that doesn't exist yet, I'll try to do that with systemd unit, so that's what I have here. That's why you can maybe see here, I will show you in a different screen. Here, I have systemd unit, which run some program here, just a script. And I have a pre command which inform trace loop over its HTTP API to add a specific cgroup, and then a post command to ask it to dump the last syscalls in a wing buffer. So the demo, let's try to do that. Yes. So here it is, I started trace loop in a demon form. And then we start to try to start this thing. And then here, I get all the output from that systemd unit. Because of the post command, I'm expecting to see the last system calls at the bottom. So here, I'm filtering on some things where I see the BC program called the white system call and so on. Okay. So that's the same thing here. I would like to integrate that with systemd status, so we can have this kind of information directly in the systemd tools. And that's about it from my talk. There is lots of work pending on GitHub that was necessary for this work to go BPF in BCC and in Kubernetes. That's it. Thank you. Is there any question? Well, as you can see from your wish list for systemd run, the trouble is, you're starting the trace loop in advance for a process that crashes early. Right? That's the same problem still exists on Kubernetes. That if you have a pod which basically crashes during startup, that you can't really capture things because you no longer, you don't have a pod yet to manually specify hey. Yes. That's possible to do because on, so the normal way to deploy things on Kubernetes is you deploy your deployment and then that will create pods but you don't know in advance the name of the pod. So you cannot just say kubectl gadget trace this pod name because you don't know the name of the pod yet if it is a early crash. But you can filter things by label. So if you have a deployment, you know the label in advance. So you can, you kubectl gadget execs noop if you want to trace execution or open snoop or any gadgets and use the label filter. And that works even if the label doesn't yet exist? Yes. So how it works at the moment is with string comparison in BPF which is not ideal but that's a proof of concept. And what I want to change it to is using an ID system which it will still work with this use case where you want to trace from the very beginning of the pod. So it works because I have the OCI hook right before the pod is started, RunC will execute that script and get the information about the C-groups on Kubernetes labels. Okay, thanks. Okay. Do you have already some overhead numbers? Can I just run trace loop in a production process that keeps on crashing randomly? Or is it still too heavy or too much overhead? I've not done any check number on overhead. I think one of my colleagues did, but I don't know. I will not expect this much because when you don't ask for the information, then it's trace loop will just write in a ring buffer which is normally very fast, so it doesn't actually does any context switch or anything like that, but I don't have number, sorry. Otherwise it's not really ready for production because of other problems it needs to be more tested on debug. You mentioned that the ring buffer captures events. How big is that ring buffer? Is it configurable and is it possible to have something else as a backing store for that so that you can make your ring buffer big enough to capture a very long series of events? At the moment it's not configurable. I think I set it to 8 megabytes, but I just choose that randomly. And it's a different ring buffer for each CPU because otherwise if you have several whiter to the same ring buffer that will not work correctly, so there is each CPU as its own ring buffer and each application as well, so different C-groups have different ring buffer, different set of ring buffer for each CPU. So I'm not quite sure which size it should be, but per ring buffer is only something in memory. It's something which can be backed on disk, but one of my colleagues did something where it's regularly fetched the information from the ring buffer in memory and keep it. But yeah, my idea was just to have something to get the last events not necessary, everything. I think that will not be scalable to trace all the system calls for everything all the time and dump them somewhere. So I want to ask, how do you garbage collect the buffers because they outlive the C-group that they monitor if I understand correctly? And then the second question, how reliable is pointer dereferencing? Is it always reliable because I can imagine that your BPF program may be scheduled after the application that you are monitoring and then the application already may have changed the contents of the memory that Cisco saw, I mean, the kernel saw and you entered the Cisco, right? Right. So for the first question, so far each ring buffer has a reference from the trace loop demon. So if you kill the demon, then the reference disappear on the memory is freed. But so far I don't have anything to release it after a little while. My idea was after some minutes or hours, I don't really know, after it has crashed and removed it. But as a proof of concept, I just keep them forever until the user specifically type close this. For the second question about what happened if we tried to start the tracing system after pod has already been started, in this case, you're right, it will not trace it because I start monitoring a specific group in the OCI hook, which if it's not run, if you have not yet installed it, then it will not catch things. So yeah, that's a limitation, but the idea would be to have a trace loop always installed on the system. So it runs as a demon set on Kubernetes to have it all the time. Other questions? Then thank you. I will be here during the conference if you want to talk to me about BPF or trace loop or something.