 Hi everybody. I'm Stephen Heminger. I work for Microsoft, but I'm also very much involved with the DBDK and the Linux kernel. I am not one of the eBBF developers even though there's some eBBF in this talk. It's not the main focus. I just don't want to see the button. There we go. I'll skip the table of contents to save time. So when I started this out, it's looking around. I said, you know, in America there's a proverb that says, you know, only a bad workman complains about his tools. And since I was coming to Belgium, I looked up. It turns out it's an old French proverb that bad workers never finds a good tool. But then I started to look around. There's actually a Chinese proverb. It says, to do a good job, a craftsman must sharpen his tools. And I do woodworking and I've learned the hard way that unless you sharpen your chisel, you destroy things. And I think this is the same thing as true with the tool sets we work with. But before I go too deep into tool sets, I kind of want to throw up a big red flag to help people don't get too enamored with the tools. That's not what the most important thing is. The most important thing is figuring out how you're going to do it with a good methodology. So start out with a problem statement of what are you trying to solve? What's your workload look like? And what are your targets? And use a good methodology to actually do this in a structured scientific process manner. Just throw it at the wall and see what sticks. And I'm not the person to go into detail about that. But you should really look up any of Brendan Gregg's talks on Linux performance analysis and the EBP textbook that is now out from O'Reilly. Do some research, learn how to do it before you get too deep into playing with tools. But after I've thrown that out, now I'll go talking about tools. So in my life we spend a lot of time working with DBDK applications. In a DBDK application, there's really two ways to look at it. One way to look at doing analysis is to analyze the data as in what's coming in and what's going out. This is traditional TCB dump packet filtering. So you basically have a mechanism to get the data to see what the application, why you're dropping packets, what's not going in. The other way to look at it is programmatically. And that's what tracing is. Tracing is all about let's insert something into the program and see what we got. The poor man's way of that is print K, this is doing something smarter than print K. Figure out what's going on. Both of these are useful and both of them are part of the solution. And on the packet capture side, traditionally with early days of DBDK there was nothing. Like the first user survey feedback we got was, I need TCB dump. So Intel and a few others developed a packet dump for DBDK called P dump. And the way it works is your application is running as your primary application. And you have a secondary application that shares a ring with the primary application and captures packets. And the packets are outputted in pcap format and you put them somewhere. The problem with this is the implementation of this is very limited. You have no metadata so you don't even know B land tags are gone. You have nothing about offload flags. What floor rule matched? So you have very little state in this packet capture. The implementation of this is not very robust and the time stamp happens at the far end of the thing. Kind of like you saw earlier when they talked about with Collecti. It's the same thing only at the DBDK level. The packet isn't being stamped when it's arrived. It's being stamped when you finally pull it off which is pretty much useless. You have no direction information. So if you're capturing and receiving transmit you don't know which way this packet came from. And the current implementation only really handles a single port. So if I've got a router I want to have both ends captured at once put in one file. Or you have to do something where you've got multiple files and try to correlate by time stamps. It's pretty hard. And lastly there's no filtering. So there's nothing you just basically you get the fire hose. And because of all this the performance is pretty poor. So for what I've done working it's not actually upstream. The only way to probe hatches that are still in progress is to support PCAP-NG in the DBDK. With that you get a nanosecond time resolution. We can put system and interface data. If you're in the cloud it's really important to be able to know. I got this data file I captured on a SUM node. What was the hardware on that node? What was the version to the firmware? All that can be put in the file that you got captured. It supports multiple interfaces so you can have some 16-bit number of interfaces and put them all in one file and put them together. PCAP-NG supports flags for direction and you can put the hash value in the packet and even common strings of any other data you want. So really it provides all that. So we're working at the details and all the API and ABI but it will definitely roll through. The next thing I've been working on relating to this is putting packet filtering into the packet capture. So the traditional way of doing that is with TCP dump is you start out with some nice text string like show me all packets with the IP destination of fostem.org and libpcap does all the magic and produces a CBPF program. So the problem is we don't have a classic BBF interpreter in dbdk and we really don't want to put one in. So we'll do the same thing the kernel does which is translate that into an epbf program. I've just worked with several people to get permission to take the same basic code loop that does that in the socket filter in the kernel, make it BSD license so that we can put that in the dbdk. So you just type, at the outer level you put in the filter string and the lower level you get the ebbf program and then we can execute that in the dbdk with the JIT and everything else that's there. How am I doing? On the tracing side there's several options on Linux. I will have to admit doing them in user space none of them is ideal. But I'm going to describe two or three of them that I tried and how they work together and what it would look like. So first one was the Linux trace toolkit. It's been around a while. It's very easy to use and very well documented. Using it in user space a typical way to use it is usually to find trace points where you put trace points in your code and then you can run filters at those trace points and it produces common trace format data which can then be digested. And from a user space tracing point of view, it's very high performance. It uses RCU and ring buffers, all those kind of performance tricks that dbdk does to basically get a fire hose of trace points and into user space. So my example with dbdk implementing the TestBMD application, dbdk, the one most commonly everybody seems to use. Oh, great. Well, I got messed up by some, never used tabs in slides. But if you look, the core loop of sending a burst of packets in the L3 forward application has a there and there's put a trace point in that says, hey, I got some packets. So you want to do the same thing with ebbf kept from dtrace that now have the ability to do this in user space. I'm kind of going fast. So the same thing in user space. This is with TestBMD receive. You put a dtrace probe point in and you get the same kind of thing. You want to use it. You end up using bpftrace and you say, I'd like to pick up the user to find trace point in TestBMD and I'd like this trace point and you give it an expression of what you want. So I did an example doing bpftrace with TestBMD and RxBurst and tell me the hash of how many packets do I get each time through the loop. Do I get a little or a lot? And this lets me decide if I should use bigger burst size, how much memory I should use, a whole lot of things. Well, the downside of this is you run some performance tests and these are not real performance tests. That's what these slides are trying to say. This is not analytical, statistical, write a paper test. This is do a hack, do an unboxing. What am I getting? So I got two systems, one sending to the other one at full speed, 25 gig nick. What does it cost me? Well, if I'm doing packet dump, so with packet dump turned off, having the infrastructure in place cost me nothing. But as soon as I start capturing, I get a 36% performance loss. Now I did a hack, say what if I had EBBF in there and I said in my capture filter, I evaluated so don't capture this packet. So everything, it was just a null program that says no. That went from 36% to zero. So that says capturing with EBBF is going to work for what you want. You still have the problem of why is it taking so long and packet capture, but that's a different problem. As far as tracing with LTTNG, putting the trace point in costs you an insignificantly small amount of time, less than 1%. And when it's enabled, it was down to in the noise. So basically at 25 gig, that trace point was costing me nothing. Now, that data may cost me something. EBBF trace, putting the trace point in costs me even less because it's just a bunch of no op instructions. But turning it on and to do that hash program would cost a 56% drop in throughput. So kind of rushing here. So we've got a bunch of ongoing work in the DPDK packet capture to try to give it full filtering. I've got a new version of the program because I want a user interface that looks more like T-Shark and less like a DPDK application because users don't care about all the wonderful hardware flags that we love in DPDK. And there's ongoing discussion on the mailing list of how in DPDK we can have a standardized set of trace points so that other tools can get in and use them. And I wanted to leave some time for questions so I sped up as soon as the flag went up. What? Oh. So I will now open the floor for some questions for a minute or two. Yes. You mentioned that you are using EBBF filters for capturing in DPDK. The EBBF virtual machine. We have an EBBF virtual machine in DPDK. It's fairly limited right now in terms of there isn't a lot of hooks that use it and it doesn't support maps and all the neat tricks you can do with EBBF. So it needs a lot more work but it's there already. But it's a complete separate code base from the... It started out from a common code base but it's maintained separately. And it was agreed. A lot of this is shared work with the developers EBBF on Linux. So it's like, well, are we willing to BSD license this? And that's the biggest hurdle often. Other questions? Back there. Right. I think you need two things. One, with LTT and G you can't do the at scale run a program on the node to give me the results. And that's where BPF shines. So where BPF shines is if you're trying to scale up by using the compute resources at the trace point. Where LTT and G works really well is if you're saying, for security or other reasons I can't run things in that environment or it's an embedded system or something, get me the data and I'll process the data later. So my recommendation is abstract... The classic Debian thing, if you can't make a decision, make both decisions. So have an abstraction that says here's the trace points. And document that and let people choose for their environment which tool they'll use to put the trace point in. I got the time's up. Yes, Thomas. How can we progress to share this? That's a good question. I really wish we could. I haven't volunteered for too many things right now. I volunteered to help. That's about all I can do. Thank you.