 Today, it's me, Sushakra, and this is Albin, and we're going to present Trace Left. It's a configuration-driven eBPF tracing framework that was done, like, with Shift-Left as well as KinVolk. So I'm from Shift-Left, and this is Albin from KinVolk. And we did this together in conjunction, and we'll see what all this is about. So I'm Sushakra, I'm a staff scientist at Shift-Left. This is some information about me. You can follow me at Texology if you want. I did my PhD from DOSL Polytechnic. I love tracing, and I love performance analysis. And... Hello, I'm Albin. I just say I love Kubernetes, low-level Linux development, CTO at KinVolk. I will say just a couple of words about KinVolk. So I guess, since you are at our system room, maybe almost all of you know KinVolk, but we are just software development team on Linux, on Kubernetes, and we love this kind of thing. And something about Shift-Left. We are a continuous security for a cloud-native application company. We try to provide static analysis and carry it forward all towards runtime. So we basically kind of protect your applications. So what's the agenda for today? So we are going to talk about Trace-Left, some background about tracing, actually, just to give you information of what we are dealing with here. Architecture of Trace-Left. There is trace configuration, because this is configuration driven, so you can actually do configuration. How the configurations are presented, how the events are taken out from the configuration, and because it's based on BPF, some background about what EBPF is, we already have Alexi somewhere here, so I'm already scared about this. And then use cases where we are using it. And then most importantly, about some challenges that we faced, Albin is going to discuss that, and the future work that could be done in this regard. So I'll start off the first half of my presentation, and Albin takes it eventually. So to give you a little bit background about tracing, how many of you have used tracing or any kind of performance analysis frameworks in real life? Okay, a lot of people, that's super awesome. So you may be using it in different kind of ways. One of the most common ways to use tracing is throughout your stack, actually. I've tried to differentiate this using this small diagram, where you must have been hearing about open tracing, Jaeger, all these new frameworks, these all fall under the distributed tracing category. This is all actually a gradient, I would say. It's not like distinct categorization, it's mostly a gradient, but you can see some differences there. For example, like in distributed tracing, you would get information about what's flowing from one service to another microservice to another microservice. You may get information about individual functions that are inside each of the microservice and how they're communicating, which falls into the category of application tracing. The moment you go a little bit more down, you can even know about what was happening inside the application from the infrastructure level. So this is where you can see what was happening inside the OS when a given function was being executed in user space. So this is kind of what tracing is. It's very different from other ways of performance analysis, in the sense it gives you exactly true flow of an application, and that's why we actually call it tracing. Because it's running on very high frequency events that are generated, such as the skulls, entraps, and scheduling events which are therein from the operating systems. So we need it to be super high-performance as well. The basis of tracing is instrumentation. I'll explain it a little bit later, and it's used for performance analysis as well as security. So based on the same base of instrumenting specific function, you can either use it for performance analysis or you can use it for security. So a very fine example which I keep on giving to people what tracing is. Think of your program as this bike which is running, and you spray some paint on the tires of the bike, which is what you're instrumenting your application. And these individual points where you had just sprayed, it's basically a trace point. So as you start running your bike, you are generating events, and you get an actual trace on the road. So based on that, you can find out where it was recorded. For example, these traces will give you exact time at what event happened. So you can think of it visually like that. So tracing can be static or dynamic. Static tracing is a lot of static tracing infrastructure is already there in the kernel, or if you're writing your own user space applications, you can instrument them yourself. For example, kernel trace points with perf, ftrace, ebpf, they all support static tracing now. If you're writing your own applications, you can have compile time instrumentation embedded. I don't know how many of you have used some certain flags in GCC like SIG profile or PG profile instrumentation. You may be able to use that. There are other applications like LTDNG which provide this and USDT. So by default, you have a lot of trace points already there in JVM, in the Python's interpreter, as well as the Ruby's interpreter. Dynamic tracing is more awesome than static tracing, I would say, because your application keeps on running. You can just dynamically insert any point there and start probing what's coming out from your application. By application, I also mean the kernel. So kernel also provides dynamic tracing infrastructure in the form of K-probes, K-red probes. If you're in user space, you can write your own infrastructure by using dynamic instrumentation tools such as pin tools and dynast. U-probes are also there. You can dynamically instrument with the help of the kernel and use a space application and then get information out of each function's execution. They used to be detrace. I think they're still as detrace on BSD and Mac, but I have not used it very thoroughly. This closely resembles what EBPF provides in tracing these days. So to move very quickly, so code instrumentation, you want to know about this function. You insert some call, which is call me maybe. When the function gets executed, the call me maybe gets executed. You collect some data. You fill your data with whatever you want. It can be timestamps if you are looking for performance. And if you don't take timestamps, you can only look for individual events. It comes into the domain of auditing and security. So in kernel, as an example of this, in kernel, the K-probes-based instrumentation is provided using this, where you actually have a kernel function, which is patched. The first instruction gets patched. And it goes on to another handler, which it goes on to another area, which is actually called as a trampoline. And where you can save your registers, you call the prehandler, which basically collects whatever can run at the prehandler. There are multiple collectors that can run there. One of them is EBPF. That's what we are using in our whole infrastructure. And then you can restore the registers, the original instruction which got displaced, gets executed, and you jump back for your normal execution. The actual thing is more complex than that. But to simplify, I have explained it like this. So what is EBPF? Yesterday, we got a very good definition. I give you one more definition from my perspective. It's a stateful programmable in kernel decisions for networking, tracing, and security. That's how the user space folks understand BPF. Maybe the kernel has different opinions about this, but I want this interface from the kernel to the user space to be as seamless that it becomes the one ring to rule them all for networking, tracing, as well as security. So just a small intro, I'll go through it very quickly because we have seen it in previous talks about this. The classical BPF used to be there from 1993 onwards. It was used for network packet filtering. Some time later, Seccom base BPF programs was also added so that you could actually do trace call, syscall filtering there. It was a small in kernel VM, very, very small and very easy to use bytecode. And then it was extended eventually as eBPF with more registers or more complex and a better verifier. You could attach on trace points, K probes, U probes, USVT, like whatnot. I'm more interested in tracing, so I'm just focusing again and again on that. You can also use it for network packet filtering or much more other use cases that have already been discussed. There's a new syscall, so you can control it via BPF syscall. There is trace collections with BPF maps or using directly taking the data to the trace pipe, which exists in the kernel already. I mean, this facility is already provided. It's been upstreamed, and 3.18. There is bytecode compilation, which is also upstream in LLVM. So if you're using Clang LLVM, you by default have a machine, a target, where the BPF bytecode can be generated. So a program looks something like this. There's a BPF program. You compile it with LLVM Clang, and then you can directly insert it using the BPF syscall inside the kernel. It gets verified, and then native code gets generated for the architecture on which you are running it. You can design the program in such a manner that it can hook onto kernel functions. The data can be shared between user space and kernel using BPF maps. With K probes, it's exactly the same thing, but it's attached to a K probe. And you can use BPF maps to read and update and share the data between what you collect and how you collect it. And the events that come out from each of the program's execution, they can be either given to the trace pipe or a puff buffer, and then you can build your infrastructure over this. So we use this as the base for trace left. A more easy example of how a BPF program looks is it's in restricted C syntax. This is actually how it looks in the back. So every time you will have your kernel function being hit, this program is going to get executed. So some of the things it has is that you can see some helper functions here. Like BPF get SMP processor ID. It gives you on which CPU it's running right now. The current PID of the current process. You can get these things. And then from here, because you have the context, which is actually all the registers, all the register values when the kernel program was hit. So you have all those registers ready. So these are simple helpers to get the arguments from that if you are following the calling conventions on whatever architecture it is. So you get these values. You can then build your own event. An event is stored like this. It's an event structure. And then you can output it to a puff buffer. So the definition of events look like this. These are maps. This is the way you can share data between the user space and the kernel. And for example, this is the structure for this specific event, which is also stored in maps. And then you can output them. So which brings us to TraceLeft. It's open source. You can go here and you can look at what TraceLeft is, how it's designed. It's a framework to build Cisco network and file auditing monitoring tools. It's a work in progress. I should tell this to you beforehand. There's a lot of stuff that can be done here, and you are welcome to contribute. It's CBPF and K-Pro base, and it has been tested to work on kernel 4.4 plus till 4.16. 4.18 is not working right now. I tried it, but we have some patches we are working on. Probably tomorrow they'll be fixed. So also, it has a binary called TraceLeft, which is a reference implementation of the framework itself. The main goal here is that there is a single binary, which is there, and a battery of what you want to trace. For example, what Cisco's you want to trace, what events you want to trace. You just have a single binary and this whole battery, and you put it onto any system for which that single binary and that battery has been provided, and it's going to generate events. There is no need for BCC to be there. There is no need for any library to be there on the system on which you are running. So it's like a pre-generated thing. It's a very targeted tracing thing, what I tried to call it. Then this was our use case, which we used internally at ChefLeft, where we just built for one specific machine. We just built this battery, what we want to trace. We built this binary, and we just put it. And it starts generating events, and we just save it. So everything is compiled based on a configuration you give for compiling the whole small binary that you have, as well as for the battery of events that you want to save. So why? Because tracing that just works. I mean, obviously. This is a high-level view of the architecture. You want to trace an application's flow of certain calls. There is a main BPF program that is there, which actually puts K probes on all the functions that you want to monitor. For example, we have just this called hairs. And then the data is sent to the program maps. I'll go in the details of each of these sections later. TraceLeft controls all of them. And then there are specific event handlers for each of them. So there is just one program that is already there, which actually calls individual BPF programs for each of the event that we want. This is what it looks like a little bit more deep. So there's a specific map. There are K probes and K-DAT probes. And based on each individual event that is there, the specific event handler BPF program is called via tail calls. So there's one base, main BPF program. And then it makes tail calls to individual small, small BPF programs, which each generate a single event for a single probe that you have put, either on a syscall or on any other network event. And then it puts all of them in a perf map. And TraceLeft keeps on reading from that perf map. There are multiple components of this. It's not an ideal scenario, because you have a meta generator which generates go structures with generate C structures because you're generating these events and they have to be stored somewhere. So it looks at the kernel, syscunnel debug tracing, even syscall for the syscalls battery of the probes. And then it generates those structures directly. And then we have a generator which generates the individual handler programs, which are on the right-hand side there. And then a battery, which is actually compiled version of all of that. And then there is an actual part. You can go on the source code on GitHub and you can see what each of these components are doing there as well. So then you have a probe, which is responsible for registering. And then there is a tracer, which actually starts polling individual perf maps and giving you the data. And then we have a reference implementation of an aggregator also, which exposes an aggregator API so you can get all the events and aggregate them. A configuration looks something like this. So you want to trace an open event. It has all these arguments. For example, first position, second position, third position, which obviously looks like open. It's a per event configuration. What do you want to collect? What are the variables you want to collect at each of the sections? You can do it. It's just done once. Once you have to just do it and update it very rarely for each of the event inside the kernel, you don't have to keep on updating it. And then there is this aggregator. So you have channels. You can save the data to a log. You can send it to gRPC. You have events like open. For example, we had open. And then you can set rules. This is still not completely OK. And then you have how you want to aggregate it. So what functions you want to apply when you are aggregating it. So these are two configurations that you provide. And based on that, the events are generated. This is the whole build process, which goes on. So as I told you that this is the first meta-generator state, which generates each individual structures, then source for each individual handlers, and then BPF programs after compiling using Clang. Maybe we can update it later by not using Clang and just directly using the LBM API. And then it generates your own binary, which is your own implementation. A CLI, for example, looks something like this. This is the reference implementation, which actually just tries to trace everything based on the battery. I think Alvin can explain you a little bit more about this. Hello. So I will do a demo. Let's see if it works. I prefer two very short demos. Since I don't remember what everything I'm doing, I took some notes. I have two shells, one on the right with this PID. And I will start the trace left binary. And I will tell it to trace. And I will load some specific BPF handler and to apply this only on this specific PID. So the handler are called for the read on the right system call. So let's see what happens when I run that. And that should trace the terminal on the right. So if I try to type something, I can see all the read on the right system calls. That's the first demo. And while doing that, I only trace this specific PID. So I can specify on the command line here not to trace all the system calls from the whole system, but only for this one. Let me start the second demo. So for this demo, I prepared a script. It's a very simple script that starts TCP connection. So after a couple of seconds, it starts a TCP server on the TCP client and connects to each other. And I will ask a trace left to trace the busybox script shell. So let me start the script. And then I start the tracer. After a few seconds, I should see the TCP connect event with specific information attached to it, like the connection tuple, TCP source destination, et cetera. And if I stop this, I should see the TCP close event. So as you can see here, I see a connect on close, but I don't see the TCP accept event. That's because I only trace one specific process. That's the process of the shell script here. On that, it didn't trace the other one. I have another demo with the help of Sushika. Here, I'm logged on the web server. I just started trace left on instead of specifying read or write, I just pass all the networking battery, BPF program that we have. So that does the same thing. And I just opened my own website, and we can see from where it connects. And basically, trace the network calls which are going on this server. So this is just a simple server which is running nginx and my own website. That's it. So there was one more elaborative demo that we did internally based on the trace left, where we had our own monitoring agent for syscalls. And this was using the aggregator API that is there provided by the trace left itself. And it looks something like this. I don't have this demo right now, but you can at least appreciate that you can make something as complex as a nice end curses UI for syscalls monitoring based on trace left. So I think Albin continues from here, and he discusses something very important that we learned. So this is more important than trace left because this shows the challenges we faced and how we overcame some of them and what else can be done later on. Yes, so a lot of the challenge we face is because we wanted to support kernel 4.4. And some of the issue we face has been fixed in later kernel versions, but I will explain a bit the context of that. So the first challenge I will explain is matching PIDs on applications. So the goal of a trace left is to have some kind of tracing profile for a specific application. And one application can be one or several process. Sometimes it can be very short-lived process. If it is short-lived, it starts on sort of a lot of process. An application can be maybe a system D unit running inside a C group started by system D or it could be a container. In that case, it might live in different Nuke's namespace on different C groups. So when we wanted to implement that, we looked at the different BPF helper function to see what exists there. I'll just mention a few of them. I just look at the ones which mention PID or C group or namespace. There is the first one, which get the current PID and process ID, task ID, which exist since Canal 4.2. So that's perfect because our restriction was it has to work on Canal 4.4. Now some others get C group class ID which is not related to tracing. So on some other which I put in red because that didn't fit our criteria that it has to work on Canal 4.4. This list comes from this GitHub webpage that's way useful as a documentation to list all the BPF helper function and see what things you can do there. So since here, basically on Canal 4.4, the only thing I could use was to check from the BPF program what is the PID of the process being traced at the moment. So as a consequence, the API or trace left is based on that. It looks, there is actually a function as part of the API where you have to give which PID you want to trace and then the BPF program is going to look at that. Of course, when we build the whole framework using trace left, we want to match the application and not only one single PID. So we need to use something in addition to that. At the time, we implemented that using Linux facility called the ProcConnector. So the ProcConnector is as a family on the Netflix Socket family. How many of you know what is Netflix? Sorry, netlink. About everybody, cool. So using netlink on the connector, you can receive, it's a published subscribe mechanism when you can receive some events. And in this case, we can get information, notification whenever there is a fork, zig or zig on the new process. So we can know whenever there is a new process and maybe that's one process that belongs to the application we want to trace. That's something that is quite old so it works fine on Linux 4.4. This ProcConnector has quite some strong limitations so it's not really perfect but it works okay. It only works and doesn't really work in a container that it has to work in an initial user name space, in an initial PID name space, et cetera. And it will require network privilege which is a bit weird when we are tracing processes. Although it doesn't give all the information we need, it doesn't give C groups information or name space. So it means whenever we use this we have to, in addition to that, read in slash proc to get the additional information we wanted. But then reading from two different source information, it brings some waste condition which are a bit difficult to solve and sometimes not directly solvable. Some, for example, a short-lived process which quite happen often on share scripts. They cannot, the trace process might not have the time to read in slash proc the information we need. So I would recommend not to use the ProcConnector for that but that comes from the limitation we had that it has to work on Canal 4.4. Now there is new BPF helper function that are more suitable. For example, we can get the C group where we are running which this one exists since Canal 4.18. And in general I would recommend to use a new facility or if they don't exist, try to improve the canal. Another difficulty we had was related to strings in BPF. So I'll take this example. So let's say in user space application in your program there is this open system call where you pass as a parameter the file name. And then since we added Kprob with the BPF on the open system call, we will at some point have this BPF helper function that will be executed that will read the file name. So it will copy the buffer, the string buffer. And then when the CIS call is actually executed in order to implement the open system call the canal will copy again the same buffer for the file name. So there are two copies here. Let's bring some things which are not perfectly fine. That's time of use issue. So if the program is multi-threaded and change the value of the buffer in user space we might not see what's really going on there. There are other issues with strings. So before since we are running on canal 4.4 we didn't have this nice helper function which can copy strings. So what we did instead was arbitrarily copy 256 bytes which quite often is enough for a file name but sometimes not. And we had another issues like let's say this is here your virtual memory of one process. You have some region which are mapped in memory and some address which doesn't map to any physical memory. If you give pointer to quite close to the border of the map region maybe you will not be able to read 256 bytes because we will go outside of that region. That could cause a fault which is fine in BPF because all the BPFL per function that you use are correctly designed not to crush your canal but still that's cause some surprise when developing this. Another challenge which is not really solved properly here it's about identifying files. So since we did, when we read our write on the file system we use the read on write system calls and we pass the file descriptors but it's not so easy to track what file descriptor match which file name. So to do that properly we will need to track this file descriptor belong to this file and so on. But here in this example if you use a dump system call then that make it a bit more complicated. And actually processes can get file descriptor from different sources. There is of course open or open at system calls. On this dump all this dump system calls but you can have a lot more place where you can get a new file descriptors. And for example you can receive one from unique sockets and that's not easily traceable. And another way is when even for the open system call where we are given a string for the file name it's not so easy to map that string to the actual file with the mount number on the Inon number and so on. That's because this path is going to be looked up taking into account the Monday space you are in the CH route, the route directory you use or if it is a relative path, the current working directory and so on. And in the middle you can have a lot of sim links which make things complicated. So we don't have proper implementation that at the moment we have something which work in some case but not everything. All of this defect leak come from the fact we put the K-probe on the open system call and that's maybe too high level where we only get the file name. There is other projects like a long log Linux security module which try to do that on a lower level in the kernel where they use a Linux security hooks where they actually have access to the proper kernel objects. They want to track the inodes, mounts, et cetera. So that will be something to explore to do something similar. Tracking networking was a bit difficult as well. So when we have a connect system call, we can see in the system call the IP, destination IP. But we don't have the full connection tuples there. So what we did is we added a few more K-probs on a specific function in the kernel to get the information we need. And that's the source is really similar to that come from another project from Weave Scope where we did similar work. Another thing, difficult part, sometimes we lost events. So I will explain two different reasons why we can lost events. So BPF programs run synchronously. I will say a BPF program cannot sleep, cannot wait when it emit events to user space. We use a path ring buffer. So a ring buffer can be overwritten if it is full. Then we just write over and we don't wait. We don't sleep. So we chose a specific size for the ring buffer and if it is too small, it's possible that we just lose some events. And another reason where we can lose events is with curate props. So Sushaka explained before our K-probs work where we put a jump instruction at the beginning of the function and we go to another routine. Curate prop is a bit similar, but a bit different too. That's come from the fact that we don't know where the return instruction will be. That depends where it's called from. So what curate prop does is to save the function where it comes from before the function call and save that. But a function can be called several times in parallel. If you have multiple CPU, if you have preemptible kernels, then it means you need to save several position and there is, you don't have infinite memory. So by default, curate prop is only able to save so many concurrent calls. And in BPF, there was a default value which comes from this. It takes this formula. And with the example of access to system call, that's where we had the most problem because the accept system call can take a long time. If you don't have any incoming connections, it can sleep for hours. And if you have several process running the accept system call, then we left several curate props going on in parallel. So we did, we work with others on the canal to make it configurable, but still that doesn't solve completely this issue. Okay. And lastly, what we could do in the future, maybe use 2.0 since it was not really an option on kernel 4.4, but that will offer maybe more stable APIs than using K-Props. We can change at any point during kernel versions. And we have a new BPF help function that will help to do things more properly as well. And we could use the LLVM API directly instead of forking shell to start this. So I will leave back to Mr. Chara. Just some references here. We have some projects that have already been done. If you have seen BCC already, you know this. There was BPF-D, which is very recent, which looks something like what we have in TraceLift, but in the addition that it also has a daemon, and you can do much more actions with it. There was an older BPF-D, which was also there, which also looks something like us. And then BPF-Trace, which is very promising. It's by Alastair, I think, and the same with Ply, so they both look, BPF-Trace and Ply, they are like languages, which look like detrace, and they directly generate BPF code and they can be executed. Then Landlock LSM, which is an LSM, it is promising, it's probably upcoming in the kernel. And then Audity, which clearly resembles what we are trying to do with syscall monitoring here, when audit system is already there inside the kernel with kaudity as a separate module. So you can leverage that as well. Some docs and tutorials about BPF, you can look at them later if you want to read something about this. And there has been research work done, obviously, on this, and you can also read about this later. So that's all. You can ask us some questions. I would specifically like to thank Kinvulk for working with us on this, and Iago, and Mikael, and everybody else. Thank you. If you have questions, you can just ask. Check. Thanks. Do we have some numbers on the overhead of TraceLeft? Some rough idea? Yes, actually TraceLeft has, thanks to, I think, you did that, there is a BPROF endpoint in TraceLeft itself. You can actually profile TraceLeft as it's running so you can check what's overhead. I don't remember the exact numbers, but I have them somewhere, and they are somewhere, I think, in the repo itself. I will check it. In the repo, there is a documentation directory, and in the documentation directory, there is a section called as performance and profiling, which tells you how to use it. I will say there are two possible sources where overhead can come from from the BPF programs, but I think it's quite low on, I don't have numbers, but other projects make numbers of that. On the TraceLeft binary running in user space, I think that's where most of the overhead will take place, and that's where the BPROF will help.