 Okay, so welcome everyone. Thank you for being here Welcome to the talk the BPF tails or why did I recompile the kernel just to average some numbers? As you can see from the BPF logo, we're gonna be dealing with some PDF and as you can see from the subtitle We are also gonna have some some talk about the Linux kernel internals, but Before I begin to explain why we're gonna have this talk. I think I need to give you some context first of all I Think that we all know the big data trend or as I like to call it the data the I'm pretty sure that most of you in the audience actually working industries That deal with big data because that's just too common and so widespread in the industry I'm just reporting some Some data and some figures here if you want to see more you have a link at the at the bottom of the slide but Yeah, we live in a in a big data era as we all know and for example by 2020 by now We have 44 zettabytes of stored data and companies 53 percent of the companies use that to perform big data analytics Because it is estimated that it increases the revenue value by almost can present every if you can implement big data Analytics to tailor to your business and this is a very very big industry and by 2023 It's gonna reach a market cap of up to 77 billion dollars So clearly there's a lot of interest and a lot of Attention on this topic and I'm sure all of you or most of you at least used or know about the main big data Software and interfaces that are used like how do my produce or spark Apache spark or for machine learning We're talking about TensorFlow for example Because they are widespread and they almost became common sense and a synonym of the IT industry as a whole But the the problem is that this is the surface. So it's a highly valuable market And a lot of companies are on the business But what actually happens underneath the hood? We're gonna have a look at that today So clearly the data is too much to be stored in any Commodity infrastructure. So what usually happens is that it to perform the data analytics. We go to data centers Their centers are a bit more complex that what this picture actually represents But this gives a fairly good schematic and summary of what I want to convey in the stock Uh, what is important to understand is that uh, usually the architecture of the data centers is Not really aggregate is rather separate between Compute nodes where, you know, the data crunching happens and All the computation and data analytics usually take place on bulky servers with, you know, high resources And then connected through the network. We have storage nodes Which are the ones that sort of petabytes terabytes of data and then ship them to the compute nodes according to the LSD and scalability requirements in that moment As you can see, however from the storage node I didn't use the good old hard drive icon the cylinder And instead I use I use flash and that's very important because that's where All the problems of the stock come from flash is a high performance storage technology that introduces a A new problem in data centers This problem Is born from the fact that it is true that networks are very high performance So There's no doubt about that. We usually there's 60 lane 16 lane pcie connections that can accommodate 16 gigabytes per second of transfer between, you know, nodes and switches but That looks a lot, but if we look at what A normal storage node looks like Where for example, we have 64 ssds We can see that actually the ssds with the they're very good performance. They can have a throughput of 128 gigabytes per second, which is More than what the network can accommodate So when we are performing the transfer, we can see that there's going to be a A throughput gap the storage just produces the data faster than the network can accommodate And this the even gets worse if we consider the Internal latency and throughput of ssds, which is right now not even exploited that much That's even worse because it could potentially be up to one terabyte per second actually and get to a throughput gap of 66 times the one that the network can accommodate A bit more in detail for For this you can actually see in the in the article that are reported, but This is the problem that I want to address. It's a problem that is known as the data movement wall And yeah, it stems from the fact that actually moving data from storage to compute nodes across the network Um, it's a bottleneck in the current industry and data center architecture so We are at a point where we see that, you know, data is just Increasing we gather more and more that and in order to make sense out of it. We need to compute information and Move it to the compute nodes However, at the same time we have the storage that is Becoming too fast to move it efficiently and now data movement is becoming a bottleneck As I said, this is commonly referred to as a data movement wall This is the the problem that we're going to talk about in this in this talk today and try to solve and mitigate um At this point, I think that it's time that I give a hint About the solution because it took some time to to develop it almost took took a year of work for me um but the idea that we are going to address now is that in order to reduce the data movement we want to Possibly move a bit of the computation from the compute nodes to the storage And in this way we aim at optimizing the data transfer So instead hopefully of transferring terabytes of data We maybe are going to have to just do a bit less than that and Just have an overall optimization and mitigate this data movement wall We're going to do this And this means that we're going to use some programmability, which is a modern trend that is Just happening all across the board Networks Already have a very strong programmability aspect. For example in the in the switches. It's something that is really happening Uh, not so much in storage and that's what we're going to try to do instead In particular with ebp Um, but before we start with the real details of the of the talk I guess that I need to introduce myself I am Julia I come from Italy, but the the work that i'm about to present was produced in the university of Amsterdam Where I am Soon to be psd candidate for the at-large research group, which is the group for which I also did this work In our research group, we really have a strong focus on data center performance and big data workloads analysis And we try to optimize all sorts of things from scheduling decisions in data centers to Network performance variability to make it as stable as possible and now My work will for example focus on the storage side. So this is why I'm I'm having this angle on the on the talk and on the topic Um Now that we know what's the problem. So the data movement work and what is actually Who is trying to solve it? That is me. I think that we can begin And talk about the details So we know that as I mentioned These are the data centers And when we are talking about big data analytics, for example, we might want to do something like compute the average of numbers Uh, what happens when we need to do this? Um Well the compute node Or rather the user Makes a request. So ask the storage node for some numbers that they need to compute the average on This travels through the network the storage node on his behalf produces the data and That that is then transferred Across the network slowly but surely and it's going to eventually reach the compute node Which starts to compute the the results and comes up with a result that looks something like the average is actually 42 um This is good, but as we saw the If we're talking about gigabytes of data terabytes of data moving that across the network is not that convenient and This does not really make sense if we think about it. We performed A huge data transfer which has introduced latency Because the storage node potentially was ready with that data way earlier than the network could then consume it and transfer it This introduced a necessary network congestion potential and it had a bottleneck on storage and all for Something that looks like a number we computed an average which is a single digit Potentially number That came from gigabytes and gigabytes of data transfer Well, the the idea of this talk is to optimize this with programmability on storage So what we're going to do is instead we're going to move the computation on the storage Somehow this is going to look a bit like this at this point the compute node We'll instruct the storage with whatever function that they need to compute in this moment is going to be again the average And the storage on his behalf learns how to perform this operation somehow And then retrieves the data and immediately starts to compute the result And comes up with the result the average is again 42 and only that is then transferred to the compute node Now we we can see that this is a bit different than what we did before and it Makes more sense because we reduce the data transfer. We only transferred the result Which by the way Also saves some money if we are paying for only a byte of network transfer instead of gigabytes and gigabytes of data It reduced the network congestion, which is always a desirable thing to do in a data center And we could actually You know leverage the high throughput of the storage because it was not, you know Waiting for the network to you know go through all the transfer before it would actually start to perform the computation now We saw that we want to move this to the storage How can we actually do this though? Um Well We still Want to keep all the data center requirements in mind We are not giving up anything from the point of view of multi tenancy multiple users need to be able to Perform their own, you know computation on the data and So of course all the users still need to be isolated between themselves Of course, we would want to have as high of a performance as possible And both the users the the tenants and the data center operators would rather have this Implemented with as low of a deployment cost as possible In addition to this though The solution that i'm looking for also aims at reducing the data movement So we have this additional requirement to keep in mind Um with this List of requirements we can go on and explore the space and look for a solution What are the options on to do this? So we all know that data centers Are made of computers. So most likely they will have operating systems installed this means that the computation That the users can run their own applications. So for example We could think about implementing the average in any kind of programming language that we can think of so we could do This programmability aspect on user space And let the users just define their own functions and do that Or another Solution that is usually for example something that the networking aspect goes Goes to is the hardware part So optimizing small operations with optimized hardware like a6 FPGAs or GPUs And that's a very common trend nowadays But we see that there is a gap in between What the users could decide to implement in user space and the hardware which is then more expensive to deploy Which goes against our requirement But there is the kernel space in between and Actually turns out that the data centers usually as you may know run linux And actually linux since a few years has been shipping with ebpf Which is which means a extended Berkeley pocket filter What is ebpf exactly? Some people refer to it as a javascript for the kernel So it is a Programming language. Basically. It's an instruction set. So if you want to Go hardcore you could write ebpf in an assembly like language But it also has a c interface and Which makes it a bit more approachable Which allows every user to write their own code and then execute it in the kernel which Happens without actually Reinstalling the kernel It's a bit like a kernel module. You can load it and unload it as needed But it has the added benefit from the kernel module that ebpf is actually formally verified It will be a very bad idea to allow users in data centers to run their own code like In kernel space. We know and we don't want to do that and But ebpf provides a safety net for that because the verification guarantees that the code is going to terminate It's not going to cross the kernel It's going to terminate in a fairly limited amount of time because you cannot run indefinitely. It has a maximum instruction number So it is going to terminate pretty quickly actually And it's not going to corrupt kernel structures. So the safety guarantees are pretty solid and That's why ebpf is actually Being used more and more in kernel because it allows to change function behaviors without having to update the kernel And reinstall from scratch, which is not something that you want to do in a data center constantly In particular, as I said, ebpf stands for burky pocket filter Extended now because it has some added functionality since What it could do at the beginning Uh, but as I said burky pocket filter makes you understand that it's mostly for networking initially And in fact for the network on the network stack, it can do some very powerful things like in-flight packet inspection Then it can actually even modify the packets. So for example changing the address from ipv4 to ipv6 or changing just the ip and implementing some sort of load balancing Or even dropping packets if you if you want to implement some sort of, you know Denial of service prevention and just drop packets before they even enter the kernel because you can Place ebpf code in the network interface card basically All these things look very powerful tools that you can have and actually the way I look at them is that They look like just primitives of read write and drop on buffers Which are packets in the networking case And in the io case from file, they will likely be just file data coming in going in the kernel So the the idea right now that I had for these pieces is that This looks like something that I could do on storage and implement just what I had in mind Packet filters, namely In order to do that though, we need to understand that while ebpf was born for networking as and is really well established on that side It's not true at the same level for the storage stack so for example in the networking stack you can just attach to a socket and Go from there in the io stack It might be a bit different and We actually Don't really have those tools. So first of all, I need to understand how the code looks like to understand what I can actually do in this in this use case so Don't worry. It's not going to last too long, but we're going to go through a little bit of kernel code before proposing the actual solution As I said, first of all, it is important to understand what's the difference between networking and io stack in My use case ideally what I want to Intercept is the read tab. So when a user asks for a read and to read a file Somewhere in the code. I'm going to have to insert my bpf extension that is going to perform the computation that I need So first of all, I need to understand how the code looks like and I collected the trace of the call trace in the kernel for a read system call and it's A bit complex, but I I knew the requirements and how to navigate that in order to find the right attach point And while analyzing that somewhere I knew that I had to look for a way to Identify the user and the file that were being targeted so that in a data center scenario We're going to be able to distinguish between multiple users Of course, it needs to be a function that allows me to inspect the data That is flying in in the buffers and this Can happen because the bpf allows to access the function arguments when they are called So I'm looking for a function that Gives me observability on the data that is being transferred and from that point on I'm going to Enter my bpf code and execute whatever the user told me to execute for that specific file What I found out is that somewhere in that call trace that I Summarized earlier on that was just an excerpt There is a function that is called copy out and It copies from somewhere to somewhere else for a certain size And this looks like something that I could actually use because I have Close to all the information that I need All the calls before were not really suitable because they didn't have access to the raw data There was some metadata in between and we we thought that it was better to avoid the user having Capability to access kernel metadata that doesn't look like a good idea So we had to exclude those ones even though they had some Good benefits of for example reporting in the file name Which was a good way to identify the user by the way Anything below that becomes architecture specific It's assembly code for example different from intel to arm to MD And we didn't want to really go there because then whatever we wrote would be specific to only one architecture Which is not not nice to have as a requirement and as a feature not really So we stood with copy out and we decided to identify the user with the the tuple process ID and buffer address Because we can communicate that from the user space to the to the copy out function And that works well enough assuming that the address space is large enough, which it is nowadays So we figured that it gave a pretty unique identifier for the user and the file that was being accessed So this is where we insert the kpro, which is the Program type that bpf can use to intercept this and then performs a random computation Not not random, but user defined rather Now that we know all of this it doesn't look like that difficult so actually Everything is going according to plan right in a way. Yeah at this point. We were able to implement a prototype And the structure that we decided to use was to offer some restrictions to the user that would aid the deployment of some And offloading of some functions But without giving too much freedom because then it would be hard to navigate the code because it the kernel code is still a bit complex So we tried to give the user a friendly interface where They can implement a two-step filter That we called filter reduce And it ideally it takes the input that is the whole buffer and file It filters so it reduces the size a little bit and then the reduce Outputs a single number which is for example the average So examples of operations is that you filter and only keep the numbers that are more than five and then complete the average Or you can do something different only keep the numbers that are equal to 18 and Then count them And of course you can mix and match as you prefer You can do mean max count exist Is equal to you name it This functions that result in a single numerical output Why did we choose this design as I mentioned we wanted to keep a standard interface But still give the user some flexibility because We don't actually know what computation needs to happen for every use case. So in this way We figure that this looks very much like a map reduce you can implement a map reduce like function with a filter reduce And it's very used in relational data processing So for example, we looked at the tpch benchmarking and we saw that all the benchmarking queries actually perform More than one filter and one reduce operation so This was looked like a In between and compromise between the full freedom of implementing any kind of filter that you want And the ease of access or only having to implement None two functions in a very specific API And this leads to the maximum data reduction because it goes from a full buffer to a single number So that's we figured that this is pretty convenient to solve and mitigate at least the issue of data movement that we were looking at before So actually for one last time we can have a look at the data center of what is happening now The once again the cpu and the user will request and instruct the stories to perform a computation The ebpf extension will actually intercept the transfer while it is happening And then it's only gonna actually return to the user then the result and that's pretty convenient because In this way what we achieved was all the all the above that we saw about programmability. So we managed to reduce network congestion But in addition to that we have this Additional gain of avoiding the copy of the data from kernel to user space And the user does not have access to the data in user space. So It kind of acts as a small sandbox inside the kernel This looks very good. But once again That didn't go according to plan Well, you read the title so the fact that I had to recompile the kernel to perform the average of numbers So something didn't quite go as expected and let me explain why well First of all The first problem as I mentioned was that I needed to know where to intercept the kernel in the repack and to understand The the appropriate place turns out that copy out was just what I was looking for But it is not an exported symbol This is not that big of a deal. It just means that ebpf in the standard kernel that ships Cannot see the symbol of the function the function name So it cannot actually stop its execution and start right before So I had to export the symbol first recompile the kernel and then it was good to go It's not ideal because ideally we would want to just go natively without having to recompile the kernel, but For the sake of the prototype we may do with this Second second problem is that As I told you ebpf Allows to have access to the function arguments Which in my case was actually the raw data that was being passed from the kernel to the user space But while the networking stack is very optimized for those operations It is not true for the IO stack because I I couldn't use native Access for that. I cannot do reference a pointer in ebpf. The verifier is going to complain And so I need to use a helper function which performs the access for me And even if it's read only so technically I'm not going to really damage the the data And we're not going to access kernel metadata as I mentioned Using the helper function adds a bit of an overhead which is not ideal But it is It works but Having direct access to the pointer read only would of course be more efficient It is not done yet. So we have another problem that is Of course ebpf being formally verified and very sandboxed And is very memory constrained in a way. So There's no dynamic memory allocation. So a variable buffer size Is not gonna serve you well in ebpf. You cannot really do that So likely you're gonna have to iterate in batches If you want to do that on the stack so and since you can only use That kind of memory. So static memory You're limited to 512 bytes, which is not a lot for big data, but you can use maps which I potentially can have way more Storage available for you. Yeah But still if you need to iterate in batches, you need to keep in mind that ebpf has a limited instruction number That it can execute in a single run Which is a million instructions But still you might hit that if you perform too complex of an operation Which is also why we wanted to keep filter and reduce As a guideline to avoid users performing all sorts of operations on the On the data, which could probably lead to overrunning the instruction number limit And last thing which is the most embarrassing is that Even computing the average is not that easy because You need first of all a helper function to convert from the charts that you read in the buffer to integers that you can then operate with Then integrations work fine. So you can perform the count max mean exists Without a problem, but average needs a division and floating point division is not supporting in the kernel So, yeah, it yields yields to not very accurate results as of now But we might implement We might kind of work around for that But yeah, the prototype that I presented to Although this there were quite some problems and hiccups on the way Um does not really Perform anything different from the networking stack. So it is it does operations that are just Reduces drop packets on the On the ios second set of networking. So conceptually it's something that bpf already has the power to do It achieves a data movement reduction because as we saw we go from a full buffer of data to just one number And it is executed in a safe isolated environment in a bpf, which is already available in data centers Right now But indeed in order to be production ready Like many networking products are free bpf It needs more support for the ios stack So there is some work to do but we we do see the potential in this and actually We think that Programmability is where ebpf is headed There has been so many development in the bpf lately and It's used all across the board for tracing And there are products that are very solid and well developed well developed like bcc bpf trace Then in networking it's used for their selium and google actually right now Is starting to use ebpf for their data plane programmability operations for security You can instrument ebpf And there's the kernel runtime security instrumentation from google again or falco So it it is expanding it's not networking only anymore and programmability might actually be the next step and in fact We we believe that that's the case and Turns out that this year these are all article and headlines for 2020 The ebpf is being looked at as a way to change the way in which linux kernel programming happens and a way potentially to turn the the kernel into a programmable linux kernel and we think that This is the direction that the things are going to We are going to work on that and we hope that you are going to join us as well so Thank you for listening and I would be very happy to hear your question now. Thanks a lot