 unless you really need to do. So there's a lot of all that we try to break that, yes, nuclear are good, we do testing, we do validate it. And the other next reason is some machine might be really easy to reboot, but some machine will be really hard to reboot. If you have a database server, even if you have a master slave setup, like failing your master, putting your slave in place, it's a really hard process that takes a while, it's disruptive in infrastructure, so some machine are really hard to reboot. If you have storage, file storage, you could create a lot of disruption, and sometimes the machine doesn't come back. So you need to be ready in your fleet to lose the machine at any time and be able to remade your machine at any time if, for some reason, we didn't test the kernel on this hardware and the machine don't come back. So that's something that we need to make sure you have in your fleet. Like don't configure a machine by hand. Make sure you have something like puppet chef fault that will do the configuration and the provisioning is automated. So that's something that you need to put in place to be ready. So the first, yes, we are done. We're using K-splice. We are looking into the K-exec to actually accelerate the reboot, especially on the development part, but we still see a lot of vision on K-exec, and that's something we will work on to accelerate that. There's a lot of hardware that don't initialize properly after a K-exec, so some of our fleet works well, but some of the machine is like, oh, no, it doesn't work. So there's still a lot of work to make that happen. And K-splice, you're still limited in what you can do. It's getting there, but you can do specific change. We probably want to keep that to a really, really specific security feature that you want to get there. And you still need to reboot to have a real kernel at some point to have a good base. So the first part of convincing people you need to have a really good communication in place. And that's what we were like, you need to go be able to give out the plan. And that's kind of the conversation we had for a while, like, hey, people were like, yeah, I finished upgrading. And they're just like, yeah, here's a three-time kernel. And then as soon as they finish, like, yeah, actually, you need to upgrade to 4.0. And that was all in like six-month period for some service owner. So you need to go there. And you need to listen also to what the console is. Lots of people are a lot of fear about kernel upgrade, a lot of unknowns about kernel upgrade. So you need to listen to why they are afraid of and make sure you integrate that into your testing or fixing whatever is broken into what is in the kernel. I talked about the REST process earlier. Just make sure you have a good process, a regular process, and just go over that and making sure it's like people know when they're going to get a new kernel. So our current really slow is we get a new kernel out every six to eight weeks. We actually follow really closely the upstream way of doing things, just like we rebase every year and we release an FVK internal kernel every six to eight weeks. So people know when they're going to get a bug fix and when it's going to get deployed. Building, just make sure, like software development 101, you need to have an automated building process. We could use Jenkins. We're going to use something more internal at some point, but just have one script that build all kernel, output some RPMs, put them in some repo, and have them ready to be deployed. The testing part is, I would say the thing that is currently still lacking on the Ether internally, but also in the community level. The kernel testing, there's a couple of F4 trying to improve that, but there's no like a good way. We really, we currently run some tests, like we run LTP on each build. I'm looking to integrate a case health test, but it's hard to know where to put your tests to as many places, and that's something we would like to see improvement, but maybe we'll continue with something at some point when we get to a good level. And if you have some tests, make sure you publish them at some point so people can use them. Other thing you can do, do some comparison testing, that's what we do. We compare the same workload on new kernel and old kernel, and we have many machines, so we need to test out some different hardware. That's a lot of testing you need to do to make sure that the kernel is good, people will trust you when you give them good kernel. The last part of that, the next part of like, it's kind of testing or deployment. Doing some kind of real help a lot of people gain confidence into your development process. You just set up some shadow traffic, that's what we have, sending the same traffic to two set of machine, like the real one and the next one, to get people or just deploy to small subset and just let it bake for a while. People will gain confidence in that. The other thing that we do, we build daily kernels, daily of our run build, but also the leaves of the upstream build. So we can catch upstream problem and deploy that into some of the machine, and so we can catch the problem faster and fix them before they get into a release kernel. So we don't have to go back, do the upstream back port and all the cycles, so if we can catch them. So we want to do more of that, do more testing on the actual upstream kernel, and maybe that's a print release performance number if we can, that's going to be a good thing so we can see some regression there. The question is, is the kernel, is this automated? I would say a bit of both. We have some basic automation that is just like, that will deploy and can run and analyze numbers and that there's some service that can do like, run two set up and compare the numbers and give us like good, bad result. But currently a lot of that is just like, let it run and see if we break something. And that gives you also a good signal, especially if you can spare some machine, if you don't mind, if you have a few web server that you can spare, run a new kernel there and see if it happens. And then the last part, having some tool to deploy, your kernel will up you a lot depending. There's no, I don't think there's a good open source tool to deploy a kernel to our fleet at this moment. You can maybe use some kind of configuration management to deploy it or develop some tool but that's something that you need to have some way to deploy a kernel, deploy it slowly, make sure you don't put the whole service like if you have a service on 20 machine, you don't want to put all the 20 machine down but you need to have that. We internally rely on a tool called Fbar, Facebook, auto rumization and that's like basically do all the operation, management of that, like draining the machine, putting the backend prod, doing the actual operation of that. So it's a good thing to have if you want to do many machine at a time. So we did this, we did this couple of time now, we did the switch to 3.10, we did the switch to 4.0. Just couple of things to watch out if you want to work without steam and you want to deploy the fleet. There's couple of guts that burn us so you should probably be aware of them. Make sure you don't rely on numbers too much because they kind of change fast. We were hoping to keep the 3.20 because a lot of our monitoring had like hard code 3.something kernel but when the switch to 4.0 was like, oh, we broke many things. So if you develop an open source tool that's rely on kind of version, make sure you can support future of major number. That was a little bit annoying. And that, if you deploy new kernels, you're gonna need to upgrade some packages like EaterEatTool, IPRode, SystemTap, Perf, Crash, NuMustap, all packages that depended on the kernel that you need to upgrade when you get, if you want to explore the new feature of the kernel. If you don't use the standard GCC on your system and you deploy external modules, you need to make sure the kind of GCC are matching when you build a module and when you build your kernel. If you don't, there's like the ABI might not match and you will have a lot of problems and falling damage on your fleet. Sadly, the kernel and the configuration management is a really difficult thing. Even if you reuse the version control, the .config file, there's no way to add comment or to track why you put changes. So that's something we had problem that like, oh, we enable something in one version and then we copy the .config file to a new branch and then we forget that we enable this specific hardware or configuration and then we deploy kernel and somebody's like, oh, why is that not working? And you find out that you disable something that you enable or if it's versa. So I wish we could do something better if somebody had some idea to improve the .config subsystem to be able to be more expressive about what you want and dependency and why you put changes there. That would be a good thing to have. Lot of performance regression we got was because somebody, and that's related to comment, somebody added a lock somewhere to protect some data structure. One example is like the Apple CTL got really slow from one use case. We were using Apple CTL a little bit but this is called got really slow. We found the problem, it was just a new lock. We were able to rework the whole area to do a lot less and reduce the contention there but that's something to pinpoint if you see something getting really slow that's probably what happened there. If you still have some patches around, make sure you don't forget them. We had patches in the past that we just know nobody knew that we had and just like for example on Perf people, oh, it stopped working. Oh, there was a patch there. I didn't know it was not sent upstream, was not commented anywhere. So that's one of the major reasons of getting upstream first is better because you don't have to deal with this problem and don't forget that. If you upgrade drivers, sometime they come with a firmware and you will probably forget about it so you probably need to upgrade these binary firmware. That's also difficult of deployment so make sure you upgrade the latest Linux firmware package usually that's enough to be deployed. But many problems that we got reported to us actually not kernel problems. A lot of people they see, oh, there's a new kernel there so and I see a problem so problems with new kernel, kernel problem. So make sure people know about the workload know that it's not a memory like some people are just using too much memory and howling and just like blame the kernel but sometimes the application that just used too much memory. So make sure you have like a good team monitoring in place to monitor kernel which we do, we use the net console to collect all the like the message output if you want to use, you can use syslogin does the same thing. We use Kdump to collect all the crashes in the fleet. We still have, we having less and less of them as we go to made new major kernel that you really see there, it's going down but we still have some but it's really good to be able to have the full Kdump, the full core so it's really easy to debug your problem. And we use also some performance monitoring we use some using ftrace and perf automation to actually collect and so we're gonna use some BPF base data collection to continuously monitor like syscal performance to make sure we don't regress over time. So couple closing advice of going upstream first is good it saves you time in the long term. It's more work initially but it will save time it will make everything better for you and for the community. Don't be afraid to push back UEE. We got a lot of people when we started doing that we're like is it really a good idea? It will slow us down and that Facebook we all about moving fast but in the long run it may enable us to move fast in the future by be rebasing easily. But you need to be really convincing and make sure people are doing the right thing but we now starting to see that it's getting other team like HBase is doing that now and some other team are looking at doing like really good contribution to principle stop forking the internal project and just making sure the upstream project is a good way to go. So as I said we have less than 20 patches and that's and I guess the best I like is the age of our kernel and definitely the average age. When we started like a few years ago we had we just deployed new kernel and there was like a lot of two things that were still aging but as we go in with the upstream first approach and going to better kernel the age and that's based on the age of when the major upstream release were released and what's the current like age of our kernel is going down so hopefully we'll get younger and younger kernel in the fleet and probably will be better for us. So make sure you send your patch upstream don't keep patch internally, that's bad please everybody wins. If you have any question if that's the kind of challenge that interests you I invite you to come see us at the boot. We love to part that we have many other challenges like that to deal with that Facebook. So it's really fun thing to do. So if you have any more question I think there's a lot of time left. Yes. So the question is how do you if you kind of crash is how do you verify it's not really working upstream but there's a board report. That's a good question. Don't have a solution for that right now you need to even know. We have an internal system that with the bug tracker with like compare the call stack and see if it's the same that then as a task in the internal bug tracker associated to it. We don't have anything. I don't think there's a good like we could look there's a kernel bug but it's hard to see if it's the same problem. So you basically need to either look at the kernel story if there's a passion that looks like it but that's one thing that we it happened to us many times. You see a crash a crash a panic and just like oh I'm gonna fix the bug and then you turn three ten and just like send it upstream or look at the upstream code that somewhere is like oh I somebody just did the same patch. So it's that's one of the other parts and that's one thing that is good about running newer and newer kernel if you just run the latest upstream there's a small a really smaller change that somebody already fixed your bug. So that's my advice just run upstream and make sure it's it's there but there's no good way to I don't think there's a good way in that that's a problem that's been a long time. How do you track the kernel bug in the community? There's been many effort but there's no like one like this rule that they're on tracking there's a kernel bug zilla but not everybody use it. Some people just post to the mailing list so it's hard to track all these things. It's a it's not to say how many people is there in the team that the kernel team by itself it's about close to 20 people but it's like an effort that everybody like every server owner actually look at their bug and like they see their system crashing and they will help the bug when we will be there as like as a helper help them debug or we'll debug them for them but there's many people looking at that internally. Yes. So the question is how many people actually work on the kernel and who work on the kernel is it a kernel team or is it separated? Most that we work most of the people that contribute are part of the kernel team and but that we work like some most people will have a dedication to one specific area file system networking performance that kind of stuff. There's some people outside that will contribute once in a while patch okay some service owner will see out there's I have a problem with a specific driver or especially more like I see that more often like on Perf I want to add this feature and some people will contribute it back they will often come to us as a first review and we'll help them since the kernel team is more used to send the same thing upstream but we'll help them send the upstream but we actually want them to send them as the RDR owner and it's easier for them to actually get the recognition than just us taking the patch. Some people were just like alright is that just can you deal with upstream? I don't want to deal with them so we will take their patch and make it go through the old upstream process but it's mostly the kernel team now for this few other like system like especially the other where people might do some contribution there but that's the way it's organized. No, no it's about 20 people there so it's there's a workflow and a lot of that is just like purely upstream work like and I would say like a lot of that like Chris Mason will contribute maybe hundreds of them just on the BOTA FS so that's a lot of effort. Now a lot of the contribution don't even see the Facebook data center like we do testing we use the Facebook infrastructure to do testing but we will just send the patch upstream and they will get back at some point next year when we do the rebase. Yes? So yeah, so the question is what do you need to submit when you submit performance patches especially do you need to what proof? So the way especially for performance patches we people will want to see numbers want to see the workload and especially they want to see the code that generated the workload. Sometimes you cannot write all of that but if you most of the time you if you can have a small like C code C example that goes with your patch or pulling to it or a benchmark program and see the numbers people like you'll see the patch improvement if you look at the history I've always had like before and after numbers people won't trust you just on the fact that oh it's better no there's like nobody who accept that as a good the value you want to see why it's there it's one thing it's really hard to compare though it's like it's really hard to oh what which hardware is that one the good thing about the OCP project there's like we have standard hardware with all the publicate set specs so we can use that as a benchmark on like Liverpool type one like all the spec is there it's easy to to compare version there so at some point we might be able to just say that on this hardware and people will know but yeah like that there's a multiple layer that it can be so you need to describe as much as you can the infrastructure any other question so the question is like what's the timeline between when you get a bug and send upstream and that's really depends but like a typical bug might take a week or two to pinpoint to find where is it and depending you definitely depend what is it but if it's an easy fix you will send it upstream it will get applied right away and just go into the next rebase and we can and what we do as soon as we know that the subsystem is accepted the patch will backport it directly we won't wait for the next release necessarily as soon as we know that the patch is good we will take it so some bug will take a couple of weeks to fix some might take six months because it's hard it's a complex there's no typical way to do it we are out of time so if you I'm going to be around for question and come see me I like what you're doing yes already definitely oh hey how's it going welcome am I just blinded to work for you at this point sure because what I really found frustrating check, check, check, check I'm actually at the conference center but yeah it's really fascinating my car off and on check, check, check, check I found it what I really liked about it you know how much it depends on the infrastructure that people don't check, check, check, check one, two check one, two test one, two oh very cute okay I guess we can start all the topic of this presentation is how are we speeding up software like PS and top and how we it's a long very long story how we came across this and what we're doing and this is oh anyone in the audience seen that the PS is slow so me personally I had complaints like PS is not working and then I logged into the system and took a look around it under S-trace and I saw that it is working just slowly getting through all the files and and yeah but from the perspective of the user it was not working because it gathers all the information first and then it sorts it and so on and there is no output for quite a long time when it does that so the agenda here is I give you a short introduction about the company I work for the projects that I'm involved with and me myself too then we'll see what is the current interface of accessing the information about the processes and its limitations and then I'll show a similar problem, actually one problem that was solved before and then I present these solutions both bad and good and we'll see the what is the performance of the good solution we are coming up with so marketing forced me to put this slide I work for Virtuosova which is the company who's been around for a long time and we are industry pioneer, we did containers before it was cool before they were called containers and we have lots of partners and lots of workloads running some of these guys I believe they are actually present here at this conference our partners and so the company was founded actually in 2007 and we are in the process of separating from the parent company and we have the headquarters is in Seattle we have offices in London, Moscow Munich and it's about 117 employees more than 100 of those are actual engineers and we have 15 kernel hackers on board not that kernel hackers are not engineers they are actually like the best engineers and we are sponsoring some of the key open source initiatives or sponsoring or contributing or otherwise involved in this so Virtuosova is the same company as used to be known as Wsoft and Parallel in Odin it's the very same company just that sequence of names me myself this is the first time I'm inserting this slide about myself actually a Linux user since 1995 I was back in the day was Lekwere very old kernels very funny hardware and software and I'm involved in development of containers since about 2002 actually they were called virtual environments as opposed to virtual machines and the term container came a bit later and I'm the principal author of utilities such as VZ-CTL and VZ package the last one I believe is sort of the precursor for Docker and then I was leading the whole OpenVZ project from its very beginning in 2005 up to last year and I moved on to doing some of the research stuff I also happened to be the long time scale speaker my first time here was exactly 10 years ago I was doing an introductory talk about OpenVZ and that was actually my first talk and that was actually my first time because I was living in Russia at the time anyone here that attended scale 4x wow so there's two of you two of us from my Twitter account is my family name Koleskin and that's all about me I guess so the OpenVZ is the full container solutions for Linux and unlike Docker we do full containers, system container we run the whole distro inside each of the container looks more like a VM but it's a container and it's been developed since the last century and it was open sourced in 2005 and that's how OpenVZ project was born and we have the live migration for containers since 2007 and the project at the time and it still is was that back in the day there were no container functionality in the kernel at all so we had to patch the kernel heavily a lot and then we found ourselves out with lots of kernel patches and we felt that we'd better merge those patches into the upstream Linux to make everyone else use it to work of porting over the newer kernel so it ended up being that we have submitted over not submitted but merged over 2000 kernel patches to the Linux kernel and that makes us biggest contributor to the container functionality in the kernel and that is the stuff that enables LXDocker, CoreOS and of course OpenVZ too OpenVZ is now being reborn in the form of Virtuoso 7 and it's becoming yet more open less proprietary more open so Virtuoso 7 is currently in beta and you are all welcome to give it a try I think that the beta 3 is about to be released so Crue is another project I'm involved with and this talk is mostly about some small work that we've done with Crue so Crue was born as a sub project of OpenVZ in order to replace OpenVZ in kernel checkpoint restore mechanism and I'll tell about it later bit older than 3 years old and the whole point of Crue is to be able to save and restore sets of running processes the processes running on the Linux system so you can it's like the sort of hibernation suspended disk for an old boot but not for the whole system for the just set of the processes that set of the process might accidentally be a container so this is what Crue does it saves the complete state of the running system and later you can restore it and then you can restore it on a different machine and it's called live migration Crue is currently integrated into OpenVZ of course and Docker and LXC and if some of you guys have seen the Docker demo where they play Quake and they migrate the Quake server like Europe to US this is all done with Crue being able to checkpoint and restore sets of running processes is a prerequisite for live migration but the focus of Crue is a bit wider than that for example you can do things like saving a periodical state of the very long running computational process like you have something that needs two weeks to be calculated and you have that running and there's a power failure in between and you lose a week of your work so instead of modifying the application you can use Crue to periodically checkpoint its state like say once every hour and in case something goes wrong you can use the checkpoint to restore it from that state instead of losing one week of work same thing for games you can have that magical save button in a game that lacks it you can do things like updating the kernel or doing something with a hardware that requires a reboot or power off so instead of stopping everything you checkpoint everything like rebooting to a new kernel and then you restore instead of starting everything up you restore it that way it's faster and if you do it within a few minutes all the network connections will remain so your users will see it as not a downtime but it's some sort of unusual long delay using live migration of course you can do load balancing with the cluster you can speed up application start top for example we did the test with Eclipse GUI that took about a minute and a half to start and then we checkpointed it and when instead of start we restore it and it takes like five seconds instead of a minute so I believe some companies are trying to do with Android phones you do things like reverse debugging like going back in time so you have the checkpoint and then you can always return to that checkpoint and test from that not from the very beginning of the application and also you do a feature one feature that we have in Crue and what we use in Crue you can inject faults into the application so for example you can close the open file descriptor of any application and see that it handles that correctly for example stuff like that so the main idea behind Crue is basically we had that task of merging all the open VZ kernel stuff into the upstream Linux and one part of it I think it was like about one third of the code was checkpoint restore for containers and we tried hard merging it but the code is in the kernel it's across all the kernel except maybe for drivers it's very invasive and no single subsystem container wanted to see our code in their bill of subsystem so and we were not the only one who tried to implement checkpoint and restore in the Linux kernel and merge it upstream there was a guy who spent a few years of his life in order to try to do the same and he failed as miserably as we did so we decided to have around it and reimplement the whole thing in user space or mostly in user space so this is how Crue was born the idea is for checkpoint you need the state of the applications running and there are a lot of existing mechanisms to gather that state, to get that state from the kernel so there is the whole flash proc with bits and pieces of information about the processes running there's the ptrace mechanism used by debugging and there's the netlink socket which you can use to get information about networking and then there is an interesting feature called parasite code injection that lets you insert your own code into a running process and run it as if you were that process some bits of information about the process you can only get it being that process so this is why we have that parasite code injection and this is how we use it in Crue of course not all the information is there and when it's not to amend the kernel we have to add some functionality to the kernel to provide this more information that we need in order to get the complete picture of what's running and so far we have achieved it with all about 170 kernel patches which is pretty small I guess and as of kernel 3.11 this kernel is sufficient to run Crue it has everything like 99% of it there are some corner cases that we attacked later there might be some more patches from Crue coming into the kernel but most of it is in kernel 3.11 if config checkpoint restore option is set so the config checkpoint restore option it appears because upstream kernel developers were not quite convinced that it's possible to do checkpoint restore in user space but they let us try it and they said like okay everything you put in there should be under this define and if you fail we just remove all this code together with all the defines fortunately we succeeded and most of the distros are set by default now so to the topic of the talk the current interface of getting the information about the processes is mostly the procted interface where you can have a directory for each process you have a directory and it's name it's the process ID and in that directory you have about 40 different files telling some information about the process and there are some like I don't know about 10 directories and some more stuff under there so it's more than 45 maybe I don't know 60, 70 and this thing is there since the very beginning I believe and it's working for everyone but it has some limitations and those limitations are first of all as we found out by profiling crew we found out that it takes a lot of time doing, reading the proct and it's because for every small file in there you need at least three system calls you need open read and close and then you repeat add infinitum because there are so many files and then there are so many processes and this is the same thing that PS is doing it's still a lot of context issues that's a lot of syscalls. The next problem is that those files in prox pretty much every file has its own unique format and some files are presented like tables with a header some files are like tables without a header some files are just this sequence of numbers and strings some files are like name column value and stuff like that and basically you have to write your own parser for each of those files and this is what we are doing. Slightly less problem is that the format is text based and basically the kernel has its all in binary form and then it prints it out for us in text form then we translate this back into binary for example when reading numbers and reading UIDs and stuff like that and then we translate it back to text for printing that's a lot of translation back and forth ideally we would get just the binary stuff from the kernel but the upside it lets you read the files from proc using just cat and see what's going on third problem is there's not enough information in there for example if you take the proc.pd it shows you the file description is open and the files that they are associated with like you see that 0 is that studying and so on the problem there is due to over months those file names can be irrelevant then there is no way to figure out what the what the position in the file is and what are the file open flags are and so on. For that we solved it by adding pdfdinfo in there for every file descriptor you have that additional information but there is much more to it it's just one example of kernel not providing enough information very big problem is some of these formats of the proc files are not extendable so this is the example so these are the mappings of the current processes the regions of memory that are mapped it has the addresses the protection beads some other information and the file name like this is the library that is mapped into the CAD process the problem is this last field the file name is optional because there are anonymous mappings and that means if it's optional that means you cannot add any more information after it this is the last field now we need VM flags in here and we can't add it because this is basically set installed you cannot change it without ruining all the backward compatibility fortunately we do have VM flags in a different file called smaps not maps but smaps has the information that we need these VM flags unfortunately this also has these statistics oops these statistics like how much memory is used is also available from that file but we don't need it well we discard it the problem here is it takes a lot of time to gather this information which we immediately discard so this is the next problem sometimes files in proc are slow because of things like this let me show you an example here add all the proc maps file read them all that's pretty fast 0.02 seconds 0 not much I guess around 0.200 or something yeah 0.200 so the problem here let's do the same with smaps you see it's one tenth of a second maybe even more that's like I don't know five times slower than that and it's just about 0.200 process even the system running thousands of containers you all have that and it would be way over a second to read it and this is for information it's okay if we need this information it's not okay if we immediately discard that information first we ask the kernel to give it and then we discard it so the problem is sometimes files in proc are slow because of those extra attributes that we are not always needing so this is the same explanation that I just showed to you and we had a similar problem when in crew when dealing with sockets when dealing with networking usual stuff to get information about sockets are in prognet there's a prognet netlink prognet unix, tcp and packet sockets and of course each of those four files have their own unique format some of them try to look like tables some of them try to look like lists there's tons of fields so they all look ugly but still human readable now it's basically the same set of problems as I just described there might be not enough information in there the format is complex and it's all or nothing approach you either request every information or no information at all fortunately there is a solution already in the kernel there's so called netlink socket that it was used to get the information about tcp sockets and it was called tcpdiag at some point I think it was called inetdiag because it has been renamed so we generalized it and added all the other types of sockets into there and now it is known as socdiag we don't need these prog files anymore we just ask the kernel what we need through that netlink socket there are two very good things about netlink socket first of all the format is binary and extendable it is designed that way you can just add some new fields without ruining backward compatibility and second is you can specify explicitly what kind of info and about what do you need from the kernel you don't get all or nothing you don't get extra bits that you are throwing away just what you need so we looked at that and thought why don't we add tasks to the same interface and that was a bad solution to the problem so basically we extended the netlink socket with things called tcpdiag and we asked information about some processes through a netlink socket we know we want to know this and this about this set of process and it gave us back our first problem is netlink socket is actually designed around network and it knows everything about network namespaces but it doesn't know anything about process ID namespace or user namespace or other namespaces and it's not clear how to amend it with all this need to extend the format or other ways do something like that and it's not probably not a good thing because it's a netlink socket it's about network the other problem is netlink socket has some sort of weird security mechanism when we create a socket it the kernel save the credentials and later use those credentials to figure out are we allowed to do this and that or not but then process later can drop privileges like you can start as root open the socket and then drop the root but it's still it is irrelevant to the netlink it save those credentials and it will still think we are root and finally it's one interface and it is used to get and set information about networking and when we use the same interface to get process attributes we can potentially abuse it to for example add AP addresses to the interfaces there is no mechanism to restrict us from doing that and this is why it makes it bad for makes netlink bad for solving our task with proc there is yet another example of using netlink abusing netlink for the bad thing and it's called task stats it's basically statistics about running tasks it's not available from proc but it's available through the netlink socket and this I think needs to go away but it's still there and we don't want to add yet another bad thing to the kernel instead we are proposing this interface a file called proc-task-diag it's a transaction file so you write a request to it and you read a response back to it if there are two users every user will get its own reply it uses the same good netlink message format it's binary and extendable and it's good because of these two characteristics it lets you get information about the specified set of processes more on this later and we made sure that we group the attributes attribute is some value like for example pd or uad or program name or something like that and we group those attributes into groups and the netlink only lets you get some information about the group so you specify I want this group and this group and this group and it gives you all the attributes on this group the thing is if you add an attribute to the group which makes it slower you are doing it wrong you need to separate this into special group so everything that slows us down should go into separate group so we adhere to that principle finally netlink socket messages are limited to 16 kilobytes but you can have many packets so whatever if there are more than 16 kilobytes is just the kernel splits that for you and this is what we have right now and we think it solves the problem that we came across and it also solves the problem of slow pf and some more but this is work in progress the current status is we are about to send the patches to the kernel we already send a few iterations of the patches using the netlink socket the bad solution we have discussed it we understood that it's a bad approach and we're going slowly going with this one we are about to send the patches to the kernel so it's not yet there and because of it it might change so it's a work in progress on this slide I'm not sure I really want to show it it describes the format of the netlink message basically it says that it's easy to add any attributes easy to add new groups the format is completely extendable and it's binary and it's pretty simple it's very easy to parse from so what are the ways in our new proposed interface what are the ways to specify what processes do you want to get information about first of all you can use test biog dump all get all the process in the system that's what pf would use by default for example for top then you can do dump all threads which means all the processes plus all the threads so the distinction is all the processes means all the thread group leaders all the threads mean all the other threads as well you can ask it to dump all the specific threads like that would probably some other process like a patch would use like get me all my children you can or crew use would use that like get me all the processes in this container starting from it in it then you can ask for dumping all the threads that doesn't include children that only includes threads finally just one process just one specific process I want to get information about this one and here are the groups of attributes basically what you can ask for base group that includes all the different ideas and the common name that that is sort of same as PROC PID status then you can ask for credentials all the primary and secondary UIDs and GUIDs and the capabilities then you can ask for stats about the processes this is same as task stats that I mentioned as another example of bad usage of netlink socket and this information is not available in PROC and we very much hope that task stats would be replaced by this then VMA task diag VMA which is synonym of PROC PID maps and the VMA's TAD all those extra information that takes along together that we don't need but someone might performance comparison for that I have VM running with this very kernel with this patches that we are going to send it's 4.0 plus our patches and let me see what we have here not too much so let's for 10,000 processes CLPS is already full what I'm about to do now is PROC all so this task PROC all is just opening PROC PID status and reading it and closing it it doesn't even parses it huh? I think there is no output it's just yeah it just puts the total number of entries that it read so you can see it's about 0.05 seconds this is the current interface the PROC so let's do the same thing with diag using the new interface so it's at least five times faster getting the same information and is also easier although I don't think it takes much of the time because instead of 30,000 system calls we do just a few so this is it at least five times faster this is the new proposed interface one another performance metric I'm going to show you is comes from PERF project these guys are who are writing the PERF tool they had similar problems about PROC and when they found out that we are doing something about it they asked to share the patches and they tested it and this is an email from Dave Aaron who did some testing and in his testing the first test is like five times faster and the second test is like 10 times faster so this is what he found I can say like independent performance testing and that's about it as I said the current status is we are about to send this to the upstream and it's not the first iteration it's actually the third iteration because the first two were netlink based and this is just a little bit of work that we are doing with Creo it's much more of that I think I can do like a week of talking about such things but this is the last slide okay I wanted to give you a link to the current code and I don't have a slide for that anymore but generally Creo is available from Creo.org OpenVizy is available from OpenVizy.org and this work is in my colleague's GitHub account it's GitHub on a vegan Linux test but you will see it on a Linux kernel mailing list pretty soon that concludes my talk if there are any questions I'll be more than happy to answer and I think I do have some Creo and OpenVizy stickers if your notebooks can accommodate one or two more thank you question time is it also clear any generic questions about OpenVizy for you or containers so basically it doesn't go to Linus Torvald himself usually this stuff is not even discussed on the Linux kernel mailing list but on some of the subsystem mailing list although this might be a subject for memory management but there's not prok mailing list so this was actually not for the generic Linux kernel list but usually it is discussed on the subsystem mailing list first then every subsystem maintainer has his own tree like there's the Linux SCSI tree or Linux NET tree with the maintainers it is discussed it is agreed upon it gets reiterated over and over you'll polish your stuff and they're pretty sensible as it comes for APIs and once it's all agreed upon and polished it gets merged to the maintainer tree and for the trees that don't have any maintainer that goes to Andrew Morton that has his own tree like maintains everything that's not maintained and these guys the big guys, the subsystem maintainers legally tenants in other ways they send the stuff to Linux and Linux very rarely complains about it it complains that it gets right before the merge window or otherwise he just trust the maintainers and this like tree of trust is there otherwise Linux would just die you know under the pressure no there's no option kernel yet because we are about to send it so we send this stuff with a netlink socket and we decided at the end we decided it's a bad approach this is a good approach we just need to you know split the patches, provide the nice readme and so on and we already have tests for that and we already have other people like from Perf team who are about to support us with this and overall I think that in its current form but it will made way into Linux kernel in the VR yeah but so it would be like four point I don't know yeah something like that alright alright thank you so much should this be live? our next speaker is Brendan Blanco with InCernal Low Latency Tracing and Networking hello everyone I'm Brendan and I'm here representing the Iovizer project which is a new Linux foundation collaborative project it's about six months old so this is my first time at this conference and it's good to be here and what I'm going to talk about today first I'm going to talk a little bit about the history and motivation of the project and I'm going to introduce you to something called EBPF which is a feature in the Linux kernel I'm going to talk about the BCC toolkit which is one component of the project and I'll also show how Klang and LLVM are used by BCC in interesting ways and we'll do a few demos for networking and tracing we'll leave it up for questions and while we're going through this even though it's a bit of a technical talk and a bit of content if you have any questions do feel free to speak up as we go because I want to make sure that everyone understands this is meant to be educational and a little bit of bureaucracy so these are some of the founding members of the Iovizer project I work for Plumgrid which is one of the core contributors and I spend most of my time working on contributing code so Iovizer actually goes back a long way in some of my work at Plumgrid I've been spending time trying to build network applications for SDN, for cloud, for data centers and that requires writing infrastructure writing infrastructure applications it's not the same as you know writing a web application or something else to talk to databases a lot it's a bit low level and while we went along and we're building some of these applications we went out looking for toolkits that we could use to extend that functionality and actually it came up short and so as we were building our product we were trying to we've ended up building out that SDK and now we're contributing back some of that toolkit and some of the things that we want to build you need to extend the Linux kernel, you need to add functionality and that can be hard for someone who maybe knows something about networking or storage or other higher level concepts or maybe different disciplines and to write code for the Linux kernel there's a bit of a barrier to entry if you write a kernel module you have to rebuild your kernel you have a lot of rules that you have to follow to build build into the kernel safely and how do you live in both worlds is a question we were asking ourselves and we think that there should be a better way so let's give an analogy so I'm going to compare a little bit to Node.js I don't actually code much in JavaScript and you can like or dislike that particular language but it has some interesting points so writing multi-threaded applications is hard that's always been true that will probably be true for a long time and if you want to have developers who have a problem they want to solve and they have a toolkit that's built towards how the computer works rather than to how the developer thinks you're going to have a bit of a friction and so in Node.js you have a different model than maybe what's typical and we've seen some of the applications built on Node.js that have taken off because the syntax, the event driven framework kind of models the thought process of the developer rather than the computer itself but at the same time because there are smart people building that that language, that tool you don't have to sacrifice all of the things that you would want writing a server application you can use a V8 engine inside of Node.js is pretty good at transcribing JavaScript which is a pretty high level kind of fluid language into machine code and it does that on the fly which is pretty impressive and combine that all together with a repository of modules of code that other people have written that you can combine together in interesting ways gives a nice velocity to your web applications that you're trying to build and that ends up building a community and moving the ball forward so what would you need to do for infrastructure applications? Well, actually the environment is completely different. You're maybe getting data from A2B very quickly or you're getting data from your database onto your disk and you need that to be done in a certain way so you have some high performance requirements that you have to meet and if you're writing for Linux kernel you can't crash that. The servers that we're writing on have to be up for years would be nice and you want those things to be reliable and if you're developing quickly it can't you can't reboot the system every time you have a new piece of code you want to try so you'd like to have in place upgrades and you'd like to have also debug tools, you'd like to get visibility into the infrastructure apps that you're writing and especially nice if you have a programming language abstraction that matches the problem that you're trying to solve you see is the language of the kernel and that will probably always be the case but see isn't necessarily the best language if you're writing for networking. Packets, for instance, have a very well-defined structure and there's been a lot of work of people putting thought into how a language that's networking specific would look. There's working with some people that are developing P4 language which is a way to write switches or network device implementations in a high-level language and they have some smart people that are working on taking that and compiling it into other implementations and actually we have some collaboration so while we're we have an infrastructure and we want to build towards a kernel but you don't want to have a custom kernel you want this to be something that's upstreamed and doing this with kernel modules works but depending on what customer it is maybe you don't trust the person that's hosting the kernel module if you could do this without turning on some painful flags in your kernel config that would that's certainly nice and like we said you don't want to reboot your system. Not all of these but these are all nice to have if we can work in an infrastructure that doesn't have some of these and for some of the problems that we're trying to solve there are already solutions for these that may or may not meet some of these restrictions or satisfy them. So as a result, looking at that problem statement IAVISOR project is the engine and the tools and the community and all these things put together that are trying to enable people to write applications for infrastructure, for data centers for moving data around in their system. So let's show something in action just what they appetite. One tiny application that I've written in the toolkit that we have we'll call it BCC BPF compiler collection and this tool that you can get install it or compile it yourself starts with a Python interface and we're going to run this program which is going to attach this little C snippet and every time my kernel I'm going to run this on my laptop it's a live demo every time a new process on this laptop is spawned like it's happened yet here spawn a new terminal and hello world now I have something else I want to write so let's let's change let's change the program so that's our hello world for infrastructure applications so this there's an infrastructure that the kernel provides called trace trace print K print K from the kernel that are written to a trace buffer and the the syntax for these is documented and if I'm not mistaken so we'll have obviously the process this is the CPU ID that the particular print K is coming from these are a series of flags that represent the context of the time of the print here's your time stamp since the system was booted and here's your message this infrastructure is not something that the advisor provides and there are a whole bunch of other pieces of infrastructure that use the print K in K probe and tracing infrastructure in different ways so we're leveraging on that okay so let's dive in a little bit so BPF is one of the building blocks of this tool that we're using and we see BPF programs we call them BPF programs and in a kind of a visual way you can see here the different components so there is the user space component that we saw there was a python application where you're kind of driving what's happening in the system and there is a system call interface between that user space component and what's in the kernel inside the kernel there are different hook points where you can attach to a particular event and you can attach your function to those events using various pieces of the kernel infrastructure and the functions that you run are those C programs that we saw written so the hello world was a single function example and we haven't shown yet but we will a little bit later you also get access to a set of tables, hash tables or arrays that the program can write data into read data from, user space can write data into read data from and you have a clean API between user space and kernel space to get data back and forth in a programmatic way so BPF, it's a pretty old technology started even before the 90s and other OSes in Linux it came in 1990 and primarily used for capturing packets for looking inside the data of a packet and figuring out whether you want to capture it and the problem with doing it for instance in the user space you could do whatever you want to analyze the packet but you have to copy the data so it could be slow so in the kernel there's this infrastructure for looking into data starting in kernel 3.18 we started to upstream some extensions that I'll go into later and we call it BPF program and a lot of the documentation in the kernel calls it program, it's not really a program there's no process ID it's really just an event handler so that event handler, it's a small piece of code that gets executed and maybe think of it like a little scripting language inside of the kernel there's an interpreter that can take those instructions and run them for you there's an instruction language for that and for more details there's a man page inside the kernel if you have a recent system so the instruction set originally the instruction set had just like two registers no stack, you could load data from a packet and give a return code pretty simple and some of the extensions that we added to really enable new functionality was to expand the number of registers, add a stack add conditionals and the ability to call functions and that comes with a big caveat, you can't call any function there's a very restricted set of functions you can call and there's a status to data structures so maps, hash tables and arrays you can look up or deletes and some of the examples of helper functions that are available would be like to do things with packets you can do things with kernel memory so we'll show some details of that later so the programs that you can attach to it's a limited set and it's more coming in the future so kprobes or uprobes are a kernel functionality for setting a break point effectively in the kernel arbitrarily so if you know an interesting memory location of a function you can set a kpro about that to print something, that's the original infrastructure we extended that to be able to run a function let's say when another kernel function is executed and it does that every time so that's one way you can attach programs. Socket filters which there's the original use case of TAP or raw sockets you can capture data. A recent one which is pretty interesting I haven't used myself is to do packet fan out so you could, suppose you have a protocol which has multiple streams so I think of something like speedier hdp2 or quick that needs some per packet load balancing to different sockets applications listening depending on what's coming over that connection. Seccomp is an interesting use case so with Seccomp you can for every system call that's made you can choose to run a program to determine whether a process is allowed to run that system call or not so this is a pretty powerful security feature that Chrome I think is one example user of that will run a BPF program to take some logical decision to lock down different tabs in your browser and the one that I'm most interested in are TC filters and actions either packets coming in or out of an interface so with this you can every time a packet comes into a network device you can choose to run a program and you can modify the packet you can take the packet and drop it or allow it based on some of your own criteria you can forward it to a different network device other than the one that was originally destined for to implement new behavior inside of one of these filters and add new functionality based on that so we have all this power and so now the question I would ask me is why should I trust you or why should I trust the kernel or trust these programs and the answer is that well you shouldn't you should trust the Linux kernel so when I start a when I try to take one of these programs that I've written and load it into the kernel I have to tell it what type of program this is if I'm writing one of the K probe type of filters I'll say that this is the K probe function I want to say kernel here take my DPS instructions these set of instructions and load it and the kernel is not going to trust anything it's going to look at all the the pieces of that that array of instructions that I've given it and make sure that it's valid to run this program so if the program makes a function call using one of these HEPL functions let's say to modify packet data you can't modify packet data in a K probe because the K probe is related to kernel functions not packets but wouldn't make any sense and then the kernel let's say take the set of instructions and it's going to verify that it's a valid program what does that mean so the a valid program is something that will execute in the kernel safely and it shouldn't have any loops because that would be an infinite loop you know I've injected code in the kernel it's consuming an entire CPU it's going to run forever it shouldn't shouldn't have any illegal instructions so it's going to verify that the syntax of that program is exactly accurate and it's going to verify that the memory that the program accesses either reading or writing it's only memory that's in the program on the stack or directly from a valid function so you can't null the reference from one of these programs and let's this is an easy one so let's take my hello world and just do there that should be interesting right nope so here we can see the BPF instructions that the kernel received and you can see it it parsed them out into into its data structure and made sure that you know that it's not accessing any memory but in fact there is an invalid memory access here and we went on and so once we verified the program the kernel is going to take those instructions and actually if you have if you're on the right architecture it will jit compile that to your native instructions so the programs that you're loading will run run at at speed if you're on x86 or arm64 or s390 who here knows what s390 is? okay not bad it's IBM's big honking machine I actually saw it at the one of the Linux con a couple months back so just to revisit so these are BPF programs and you can do lots of things and so we were looking kind of just at the C subset of the programs that we're loading but the workflow itself now we're kind of building up we have a tool chain that will take your program and you should be able to write in a high level language like we said and run that in the kernel so we have this workflow that leverages LLVM to take a C program or really any program that you could parse with LLVM we actually have some examples where we have a custom language that generates the LLVM intermediate representation and you can also write C so the clang front end has a BPF output option but there are some quirks about the programs that you can run in the kernel so we saw there that that that the let's say take the print K for instance that we had the we passed the print K a string as one of the arguments with our hello world in it and you can see when you declare such a string that goes into the global section but a BPF program doesn't have a global section it doesn't have really any section it's not an L file it's just a series of instructions that you're giving to the kernel so you have to somehow trick the program into using a string that's on the stack and there are other such restrictions that we can show so taking that first one as an example we want to be able to convince the clang that the string is on the stack so to do that we go through this process where the BCC library that we've written will take the C programs and go through this workflow to create valid BPF instructions and if you've never worked with clang or LLVM there's actually some really cool things that you can do and I would urge you to go ahead and take some of those for instance there's a C++ interpreted or interactive mode C++ demo that you can do you can write C++ on the fly and it will in the same process keep generating new instructions for that that's actually one of the features that we use where there's a jit that you can take a valid program take the syntax the intermediate representation of the program and convert it to native code here's a little bit more detail about about the process that's going on there's a rewriter that can take a C program that we feed in and convert that into another C program both are C programs it's just that the second one is actually something that will convert to valid BPF and the programs that we're writing they'll go through the standard LLVM workflow after that and you'll be generating optimized BPF instructions rather than something that is simply written so for example we had that TracePrintK that we had in our hello world if you were to load this TracePrintK without the translation it would be rejected by the kernel because it's accessing an invalid pointer but you can use the rewriter to basically expand this as a macro in this case but some of them you wouldn't have been able to and it would rewrite this program into this other syntax so here's another example so when you're running a BPF program you're only allowed to access memory that's on your stack but if you want to do something interesting in the kernel you probably want to do things like pointer dereferences and if your pointer points to some place that's not on the stack well what do you do and in this example we would be attaching this program as a K probe to let's say look into the previous PID when we're doing a task switch in the kernel so every time you go the kernel schedules from one process to another and do something interesting with the previous PID so you would write this syntax but you can't actually do this there's no dereference operation in BPF that you can use what you have to use instead is this helper function that reads the kernel memory and validates that basically does a runtime check to make sure that the pointer that you're asking for is a valid pointer and it will fill in the data structure for you so it's basically doing a checking that your pointer is in bounds and it's doing that every time here's another example so if we were to write a network BPF program you might want you might have some interest in doing the IP address the source and desk IP address in each packet if you wanted to do something like this very simply just look at the source and desk address of the packet you could write a program that let's say does the arithmetic to figure out what address offset from the start of the packet to do but we've actually to simplify some of the programs you can translate this to a helper function which will read the contents of a packet and this gives for instance the offset 0th bit of the 16th byte and give a 32 bit result there's a so going back there is another example here that we have hash tables and arrays that we can use inside these programs and the implementation of that is actually a little bit interesting when you load the program it creates a file descriptor for you that's unique to your process and it's not actually a named data structure you have to have it allocated for you to be able to use the map functions you actually have to know the file descriptor of that of that table and pass it to this helper function and the rewriter will assist you in that any questions on the rewire? you can and we should be able to do that for instance with there's a debug flag that's exposed through this API let's take out the null the reference first it's a one-liner so you can only see it here at the end that's the the trace print K you can see that the translated output here there's a runtime flag to do that so with some of the more complicated examples that are here this is not that interesting but there are some interesting examples of the rewriter that I wanted to show you to look on the GitHub and experiment with those so I have one demo that I want to show that not switching gears, looking at the networking what's something that you could build with this toolkit for networking and suppose I had a a set of machines that are running some cloud workload that have a VXLan tunnel established between multiple machines that are doing an overlay that's very typical in open stack environments I could I might have some troubleshooting that I would want to do and I have some idea for how to add some metrics or analytics into that I'll show you what I mean so let's start up a simulation here where we have 9 virtual machine hosts or container hosts or whatever that establish a VXLan tunnel between them and within each of those hosts I'm going to start a whole bunch of clients talking to each other over that VXLan tunnel so let's just have a team up here with a whole bunch of things going back and forth and so this tool I talked about maybe a couple of weeks or so a week or two to write this demo on top of the tool and what I can see here for instance is all of the hosts that are talking to each other what bandwidth they're using so here for instance I can see I'm running this particular analysis on the 172.16.1.100 and it's looking at all the other hosts that this host is talking to and I can with this utilization hover over the different endpoints and see the utilization and that's well that's fine I could do that with just monitoring TCP dump or that flow or various other tools can do the same thing but I can also let's for instance filter by question this is a core diagram implemented in D3 so the front end is D3 there is a back end in Python that's collecting the output of the BPF hash table that's holding the statistics in the kernel and just presenting them as JSON whenever the browser requested and so I could for instance filter by the XLNID so I can see the different tunnels that are going across these hosts I can filter by them and I can also dive down into maybe I'm interested in a particular endpoint to see which inner IP addresses are being carried over that so I can actually filter by endpoint and see the packet inside the encapsulation so I can see the statistics for this 1-2-1-6-a-0.3 talking to 0.1 and see what contribution that is having so let's let's add a noisy client in there somewhere and some trouble on my network and some service wasn't reachable I could add this to the tools in my tool bag to be able to say, well I can look at this graph and say, you know what there is definitely a problem between these hosts because I can see here just with a simple glance while I was paged and I'm under the gun to figure out what's going wrong and try to stop the problem just with a simple glance and say, well there's a problem between these hosts and in fact I can tell you exactly who is consuming that bandwidth and I can go and track down using whatever other tools and find that client and stop them and give it a couple seconds while the bandwidth evens out and we're back to normal so how do we do that so we have our complicated VPF programs which can parse IP packets and keep counters across multiple layers of encapsulation so this so there's a various C code here I want to show we don't have to memorize this while you're watching this but we can see that it parses the outer packet keeps some some statistics in some helper location calls another helper function to do parsing of the inner packet and then we'll combine those inner and outer IP addresses into a key and increments statistics on the number packets and the number of bytes associated with that key that tuple so that's how we wrote that program and there are lots of other things possible that's this is just the start so some things that we're thinking for the future where we want to take this at least from the networking point of view is to create this idea of IOModules so fungible pieces of code kernel space and user space that you can download and add new functionality to your system at runtime without worrying about without impacting production and so I'm making the analogy again remember back to the Node.js at the introduction an MPM repository like thing of IOModules you can download and just run them from we want to enrich the SDK to right now there's a Python Python interface and it's a direct access to that CAPI and we're going to clean that up a little bit add some nice IDs on top so that you can combine multiple modules together connect them and create maybe a simulated network or real applications using that we're also working on a UI so that when you're connecting those modules together that you can see what's there and really implement it on your server like how are packets flowing through the various components and be able to deploy them on the fly you can download them create new topologies or create new use cases so at this point actually I'd like to hand off to a colleague someone who's helping out with a project from Netflix question well the way that the Python API is implemented right now you can't do more than one so you would have to change a bit the API and for it does support more than one and the things that come in the order that they're attached there's actually one of the features within these programs is that there's the ability to call other programs from this program so I could conceive of a program that you can attach to a K-Probe when you have more than one and you can customize the order so that's actually possible first go ahead so the question was how do I see this fitting in with the unicorn unicornal development so if those unicornals support BPF as one of the attachments, it's a very lightweight and robust tool and it's usable in lots of different places and it doesn't have a lot of requirements to be able to do the C to BPF translation the way we built the library it's a single .so and so you can combine that off-box and take it along with you because LLVM and clang itself is built to be a library so you can package it in that form and it works just fine so I don't know if anyone has tried that but I think it's possible so the question was if you try to do two hello worlds the second one would stop there's a single place in the kernel file system that you can attach K-Probes and the way that's implemented right now each one has a unique name depending on what you're attaching to so if you're trying to attach two hello worlds both to Cisclone, they both have the same name so you can only attach one of them let's switch into the next talk and he will answer that question so do you want to use this or that my name is Brindan I'm also a Brindan and if you join us to work on the BCC project this year if we call you Brindan as well BCC and ABPF can do lots and lots of things so much so it can be difficult to really comprehend it and so one thing I'd like to do is show a particular use case that BCC makes possible I've been creating these various tracing programs for for BCC and then publishing them with a little open source and the one I'd like to start with began with a it began with a post I did on 18th of January so this is new stuff EBPS it allows us to write arbitrary programs that the kernel will run so one thing that the Linux kernel has not been able to do for a long time or forever is to do frequency counts stack traces I do performance engineering at Netflix and I deal with stack traces all the time so I deal with them for profiling CPU usage looking at blocked code paths so using EBPF and BCC I was able to hack the functionality so that the kernel can collect stack traces here it's on the submit BIO function and then frequency count so while I'm tracing here I hit this code path one time I hit this code path 79 times so I'm submitting block device I owe via VFS 3.9 64 and so on this is a fairly basic capability it's really useful to have and it exists in other traces it's really useful to have for kernel exploration so I'm trying to get my head around this function what code paths lead to this function is that this works on stock standard Linux 4.3 and how I did this was actually by hacking BPS because EBPF can do all sorts of crazy things we're going to make that a first class citizen with some API calls for stack traces which is important now that we can do this we can start to build other tools so who's used flame graphs before excellent so we're going to like at least a dozen people flame graphs are a visualization for profile stack traces this is an example of a CPU flame graph and I'm using it to understand why I have system time so time in the kernel the X axis has no meaning it's not the passage of time it's just I flip around stack samples to maximize merging it's actually alphabetical the Y axis is the stack depth what's useful here is the color is random so don't worry about colors for this one the wider it is the more often it was on CPU so I can look at this flame graph and I can say well we spent 57% of our time in do page fault that's great I can investigate other like here's sys open so why was sys open on CPU so sys open called do sys open and that called path open add and so on the top edge shows you what's on CPU and everything beneath it is ancestry so flame graphs have been around for a while they're really useful we're using them at Netflix to solve issues we got them to work with Java so we can do mixed mode flame graphs and you can see the Java code and go back into the kernel that's fantastic they solve on CPU issues so when you have CPU usage and you want to know why there's a whole other category of off-CPU issues so when my application blocks because I'm waiting my turn on CPU I'm blocked on a lock synchronization lock, conditional variable disk.io network.io and everything at page fault as well it's doing swaps I'm waiting to be swapped in how we traditionally attack all of those other issues is using a variety of tools let's use iostat to look at disk let's use tcp dump to have a look at the networking and it's a lot of work in the kernel the scheduler deals with all of these off-CPU events the scheduler switches the thread off so for a long time we've wanted to attack all of those off-CPU issues with one approach and that is instrument when the kernel takes you off look at the stack trace because the stack trace tells you why the kernel was taking you off the stack trace will be I'm in a page fault or I'm in block.io or I'm in conditional variable that was not possible on Linux unless you go and get an add-on tracer but now EVPF and BCC has the ability to use kprobes which is kernel dynamic tracing I can trace the kernel scheduler functions and I can pull out stack traces and so now I can do kernel off-CPU time flame graphs which is really exciting complementary to on-CPU flame graphs this shows the stack traces where we were blocked I'm just showing the kernel path so I can now see make I was actually doing Linux build for this one make was blocked on sysread, vfsread pipewait must be waiting for something else I guess the make command had launched some other sub-command and it was waiting for its output if it's in pipewait there was something else to the right what's this guy of course there's a little bit over here where make was also waiting on do-wait that's probably waiting for a child a forked child process to exit and the parent's doing the wait and so you can explore this and get a big clue about here's shthebondshell there's AS let me bring up I think I ran a sleep command in here I ran doing nano-sleep it was over here this one caught, I ran the sleep command, the command line sleep3 for 3 seconds and you can see sleep was blocked on sysnano-sleep in total for 3 million microseconds or 3 seconds so I'm just sanity testing my own visualization making sure the numbers add up so this is great, off-CPU flame graphs I can retire as a performance engineer I can do on-CPU flame graphs off-CPU flame graphs I can solve everything except it doesn't work that well if you have a look at a lot of these co-paths, like sshd for example sshd says this is on my blog so you can look at this later sshd says I'm blocked on sysselect do-select poll and then that's it now we're going to the kernel schedule I'm very illustrative it doesn't tell me what it's really blocked on because I've gone into the I'm waiting on a file descriptor so what am I actually waiting on here I'm waiting on another process what happens is another process which is at the other end of the file descriptor will do its thing it will then do a wakeup and then I will go back on cpu and I'll read from the file descriptor now the wakeup is a kernel function so I can trace the wakeup as well and then get information on that wakeup enter the next visualization which I haven't blocked yet because I only did this in the last few days so this is brand new now I'm taking an off cpu flan graph and on top of it above the gray line I'm pasting the wakeup stacks so you can see I blocked on this co-path and then this is the thread that did the wakeup for me so it gives you the next level of information so let me just try to bring something up quickly I was looking at SSHD earlier so I can use the control app to search there's like SSHD co-path if I click on that so SSHD okay we know it went into poll because it's waiting on a file descriptor now I can see I've only done five stackframes here it was woken up by TTY receive buffer which shouldn't be a big surprise because SSHD is waiting for piped IO great except there's kind of a problem now we're going to kworker use 16 colon 1 so now you discover the next problem which is always the case with software engineering you think you've solved it but you haven't that is another thread it's blocked on something else so you need to go to the next level of wakeups in fact I can explore this a little bit so that was use 16 colon 1 okay so I just searched for use 16 colon 1 it's down here and use 16 colon 1 it blocked on return from fork it's blocking on these guys it's waiting for these commands to finish so like the CC compile the boon shell and it's because I'm doing a Linux build over SSH and so you have these smaller commands they're running the generating output and that's being printed to the screen over SSH session and by navigating through this wakeup flame graph I can see not only what I blocked in the off CPU stack but it's wakeup and I can never get around to it's wakeup and it's wakeup awesome this all of this information is frequency counted in the kernel so I was able to frequency count not just the off CPU stack trace but I was also able to save the wakeup stack trace and then associate it with the off CPU stack trace and frequency counter I've not been able to do that before with any other tracer it's possible with EPPF and BCC as they have more capabilities and I can code a lot of this in there so what I want to do eventually is turn this into a chain graph where I can paste the wakeup stack and so you can go all the way to metal because metal wakes up everything so then you'll see who woke up who woke up who woke up who and then woke up me I'm not the only person who thought that this would be a good idea I know some of the engineers in gaming have similar issues where they've been down they really care about performance and frame rates when you're playing computer games and they have similar issues where it's not just what I'm blocked on but when you're full of wakeup you have to walk all the wakeups to fully understand it so I'm really excited because it's one thing it solves but BCC and EPPF solve lots of other things so just to give a couple of demos we've got there's open snoop so on the IO visor BCC GitHub repository there's a lot of scripts and the way I've been sharing them is I've got a text file for each of the scripts so things like biolatency, biolatency example so this is doing histograms of block IO latency aggregating that in the kernels for efficiency so it's only printing out the summary to lose user level and this starts to look similar to other traces like detrace and system tap it is and this is some of the things we've done in other traces before that now the EPPF is part of the kernel implemented and I think I can run a couple of these at the command line just to finish how much, when you're doing some of those analysis, how much overhead did you see collecting those statistics it's pretty low so EPPF is jittered but also the way I'm creating them I'm using as much as possible to do in kernel maps and aggregation so I'm only printing out summaries so I'm pretty impressed so far we'll see how it stands we started to use the level stack traces as well because that's a bit more CPU work but for, I was doing the off CPU tracing just very rough numbers I was going up to 100,000 events a second and getting a 3% CPU tax and so that's not too bad, I mean I do need to understand that I didn't even turn, I forgot to do cctl-w in turn on the BPFJIT command yeah there's a BPFJIT to enable I should actually be lower than 3% but it's really impressive so far the overhead so where's the tools question? that makes sense to me I'll let Brendan look at the the error, maybe your laptop sudo oh that's an operation not for me I see you're running it I can't type sudo on Devara, I'm serious okay great really quick to run, really quick to start down frequency counting on IP stack traces the question is what about things like graphite so in reality this is awesome but even companies like Netflix there's probably only going to be a few of us who write BCC tools and use them a lot of people use them from a GUI and so we're developing our own GUI vector where we'll be able to get these BPF metrics in but the same will be true for lots of people, you have your own performance monitoring product your analysis product with a GUI what EBPF means is the kernel can now do all the cool things like heat maps and histograms of latency in kernel what you need to do if you're the developer of your GUI or a customer is get the GUI to access it from the kernel you can write EBPF programs directly in C and in the kernel source code there's some under sample slash BPF but if you want to use the beast BCC Python front-end it makes things a lot easier and so a lot of the tools I'm now publishing are written using BCC because it's quick and that's what Brendan's been developing and if Python isn't your thing there's an underlying C library that you can working on go bindings for a different for the networking IOModule use case so it's pretty versatile question? Yeah the x-axis on the un-CPU Flangraph is not time, it's population and so the left to right ordering and that's different it's different from what a lot of people expect what I'm doing is sampling stack traces at 99 hertz throwing them all up re-sorting them to maximise margins so that you can see the shape clearly and that's the left-right ordering doesn't matter just sample at 99 hertz the off-CPU Flangraph works a bit different that's where I'm just tracing the scheduler events measuring the time you're off CPU and then throwing that information up with the widths based on the time and again re-shuffling things to maximise merging the widths mean how many samples so if I had a hundred samples in this function and one in another it would be a hundred times as wide so it is useful information visually the wider it is the more it's on CPU that's the stack depth so as you walk down the stack depth for the y-axis I think we're at time but keep asking questions if you haven't we'll take a few more questions yes can you repeat the question by the way if you had a Futex that was never woken up can you figure out which process you were waiting on but never did the wake up sometimes there are things tracing can't do and that includes like in the absence of an event can I figure it out so if another thread is never doing the wake up well I can't trace the wake up because it never happens it initialises so if you're premeditated you can trace the initialisation by Futex so you'd have a log that sort of thing is better suited for kernel debug if you can interrogate the kernel right a lot of the kernel debuggers are not real time well I mean that's something that's an opportunity for you to join the development community and post that as a request or we can figure out how to maybe it's possible I don't think we've ruled it out I'm ruled because you could have a log of who grabbed things and then the time that it was last grabbed it's not the first sort of use cases I go for traces because traces are usually based on the event so the absence of an event makes things a bit hard but it may be a way to shed some light right as soon as you're talking about Mutex, memory allocation, schedule events we're in the territory of 100,000 events a second a million events a second and you really care about overhead and so that's part of the reason why EPPF is making more things possible since it's lowering the overhead down we get to try things like the wake up plane grass which I know start to get prohibited using other traces there was another question up there I've used every tracer I've used every tracer there is KTAP was great it got into staging it was then asked to support EPPF so that it could be a front end just like Python VCC is and then development on KTAP has come to a standstill and so I like KTAP I thought it was fairly innovative and I wished it made it into the kernel but it did need to integrate with EPPF as the back end as a KTAP seems to have stopped as for other traces system TAP has done a lot of great work in terms of the front end as well and also support for USDT and user level things it doesn't appear that it's ever going to be mainline but if system TAP were to use EPPF as a back end just like BCC is doing it could be a different story and also as Brendan said one of the big values of EPPF and this work is the verifier so I will write scripts in other tracers that I can write really quickly and then they panic my system 10 minutes later here I write really quickly and it just says no I will not let you dereference that and it's a more short term pain because I have to make sure it's right but it saves me for the long term pain because by the time EPPF will run something it will not panic the system that's another difference and that's why we want a lot of the other tracers to use EPPF as a back end because it is safer and has lower overhead the way it gets compiled it does depend on kernel headers but not the sources or debug symbols I guess one more question and then we'll break and you can ask this question in the break one more question I guess it's time to break thank you hello hello can you hear me okay our next speaker is Mark Fashe with Dadoop on ButterFS hello my name is Mark Fashe I work at SUSE I work on file systems at SUSE I'm the maintainer of the OSMS2 cluster file system I work on ButterFS quite a bit and I do general file system stuff at SUSE for SUSE Labs so the talk today is on deduplication in order to talk about deduplication we should define it and quite literally we're talking about just removing duplicates of data nothing super complicated there this happens across files across the file system which differs from say compression when you think of making a zip file or something like that or even file based compression within the file system a lot of people might ask how does this differ from compression conceptually we're leaving the data alone not really because we're blowing away half of it and pointing everything back to one copy but conceptually like the stream is intact we're not compressing it there's two forms of dedupe that we looked at and generally there's two forms of dedupe so there's inline dedupe and that happens in the right path of the file system or the block device so mostly in this talk I'll talk about the file system but there are block devices that dedupe on that level in any rate if you're doing it inline that means that the right path has to calculate some checksums and potentially maintain a table of duplicated hashes so that's going to impact your right performance the I'd say the tradeoff you get is that you're dedupe right away so in theory if you start with an empty device and you're deduping as you're writing to it you would never write the same thing twice so that's your tradeoff and you want to trade some right performance to always just get the maximum dedupe you would do that out of then dedupe is basically happening later so we let the data get written to the disk and at a later point the admin or whoever is responsible for the data decides this is what I want deduped right so because it's a deferred process you have no impact on the performance of your right the downside of course is that temporarily you're using more space than you will later because you'll dedupe it later so that makes sense to everybody straight forward cool and just raise your hand if you have a question I guess that would be the easiest way to do it so we had customers asking about deduplication at SUSE and ButterFest is a natural fit for something like deduplication the reason that it is a natural fit is because it has to refcount its extents so already ButterFest understands the concept of one extent being shared amongst multiple files and usually you're creating maybe a snapshot or something or cloning an extent and this is essentially just the reverse of that process where instead of cloning it and taking an extent and putting it maybe in another subvolume we're pointing to it from another subvolume well we're taking the pointers essentially I guess what I'm trying to get at is at the end of the day it looks the same on disk basically it's just kind of another use of the extent pointers that they have that makes sense I know that was a little the other good reason is because we supported at SUSE so obviously that's where I'm going to look first and I can ask to tell you guys that there are engineers that work on ButterFest at SUSE I'm one of them and it's actually a pretty good number I would say we have very good coverage of bugs and features and what not so that kind of describes a little more of my background I guess I started on ButterFest just with that sort of bug fixing adding features that we needed and then it turned into like okay now that we've got things stable what are those sorts of next steps we want to do right and that's the advantage that they have at my organization right as we've got a lot of people working on it and we can do that sort of thing basically alright so before I get into the nuts and bolts of how we dedupe I wanted to describe how extents are laid out on ButterFest so the actual data is laid out pretty much as you would expect in extent on any extent file system so there's no head or anything it's just on disk at some offset with some length ButterFest maintains what's called an extent tree and this is global for the file system and this is keeping sort of those keeping track of all of those extents on disk and that includes things like ref counts on the extents if you have an inoded ButterFest and it wants to point to some data it has its own item an extent data item and that points directly to the extent now the reason we do this is because we can keep the extent tracking separate from say snapshot so when you create a snapshot all you're doing is you're creating new extent data items that are pointing to the actual extent on disk is that followed does everyone follow I had limited time I think for a picture might have been a little better so that gets to basically the process of cloning extent right so if you were to ask ButterFest look I have this extent in one file and I want this extent to be cloned into the other file let's ignore what happens well what happens to the data that it gets cloned over is that it gets thrown away but you're fine with that so essentially that's what ButterFest is going to do is it's just going to rewrite that little extent data item and it's going to point it to the extent that you asked for yeah sure absolutely yeah yeah yeah right right it's a good question so you're doing that because you don't want to keep all the ref count and all of that sort of global metadata you know spread out amongst multiple snapshots so you keep that in the extent tree right that answer the question yeah yeah yeah yeah you have different information so like when you have another example of why you do it separately is when you describe the extent in the extent tree you're just saying well this extent lies at this offset on the disk and for this length right when you describe it for a file well you have additional information like well where's the offset within the file that this extent is located at right yeah so that stuff you know as you can imagine right it would get really messy to have that all in one place so yeah make better sense anyone else alright yeah so now with the background on butterff I'll talk just how I implemented dedufe we chose to do out of band most of the people we talked to were not interested in sacrificing their right performance there are actually patches for doing it in band on the list today actually so that's pretty cool but yeah your choice with the in band is going to be using a lot of memory or maintaining a table on disk basically right and you're going to be computing check sums while you're doing the right so we chose to forgo all of that right one because a lot of customers were not too happy about the idea of losing right performance the other one is it's it's just a lot more simple to hacking the right path can get complicated it's very performance sensitive and it just seemed easier and better basically to do it be an I octal and just do it later right yeah yeah so the yeah so that's the idea and then when you it's it's a good idea right and that's and that's kind of where I'm going with the out of band you do right the idea is that that's giving you the flexibility right the the admin can know okay hey look we're not busy between midnight and four a.m. right so now is when we're going to run our dedufe or something like that or maybe you have some better intelligence right that's that's monitoring the system no this is when I want to run my dedufe process the reason that we keep it out of the kernel at that point is because there's no longer any reason to have it in the kernel basically if that makes sense so putting it in the kernel at that point is just you know adding a lot of complexity and potential bugs and stuff so but yeah yeah that's a very good idea and that is ultimately where I'm going with my project is is for it to be run as a demon you know when you want your dedufe so okay so yeah so that was that goes over why we chose to do out of band the other things that I realized when I was doing the dedufe the dedufe excuse me is that no matter what you can always have collisions no matter what checks I mean you do no matter how strong it is there's always collisions it could be compromised people one thing I have learned people are very unforgiving if you corrupt their data like the one thing I have learned in file systems you do not do that so because of that don't trust the check sums right yeah go on please yeah I mean I would assume there's still some mathematical probability right yeah yeah but then you go okay well you use like what was I using initially I was using SHA-256 which is way overkill you'll get to it later what I want to be way overkill for my purposes but um yeah yeah I would say my response to that would be that I could tell that to a customer and then they say so there's still a chance it might corrupt right yeah so you just you don't do it yeah yeah yeah so yeah go ahead please so yeah so okay okay I see what you're getting at yeah yeah yeah I would agree with that yeah definitely yeah yeah I agree I agree yeah it's just it's we can't say you know yeah no matter what chance someone gets upset basically if that makes sense yeah no but I I mean I generally agree with you it doesn't mean though it does free you up to do some neat things because you don't need really strong check sums anymore so that's kind of nice by the way check sums on the file system level are not there for detecting duplicates of data though right as you probably understand that actually they're there for detecting errors right and the reason I bring that up not because of anything specifically you said but a lot of people come to me and they ask hey why aren't you using the Butterfest check sums for this right and that's one of the big reasons is well they're there to detect errors they're not there yeah yeah yeah yeah but answers your question though more let's see yeah but I I agree but it's yeah I have to take care of it yeah I can't do otherwise basically so so ultimately though what that means is that if when we get into the kernel we do compare byte by byte we make sure okay if you're asking to de-duplicate this data right that you know the page is absolutely the same when either end before we do that that's that's the promise you can give basically right you say no we're we're checking this is a you know yeah exactly yeah exactly yeah absolutely no you're absolutely right you're absolutely right yeah it's true so this is the outside came up with and it's it's pretty straightforward actually you just you give a target file with an offset and a link and then you do the same thing you just you have a nice fat array with a bunch of fd's in it and you send up a request and you say hey I'd like all of these files de-duped right and at these offsets and what not um as I as I already described the iout tool will go in there and it does a byte by byte compare so there's no hashing that happens there it's just a mem memcom basically right so what it's allowing is that allows you to do the hashing and user space whatever you want right you want to store it in a file you store it in a file you want to keep it in memory keep it in memory use this use that you know just discussing which hash to use I've had maybe four or five patches of different hashes because everyone has their favorite hash or their favorite checks on algorithm and stuff so let's see yeah it's like I said it's pretty straightforward it does everything under lock it returns whether you've de-duped or not so that the user space understands you know okay has that compare failed right was there any other reason why you couldn't de-dupe right there might be like permissions issue or what not um it's entirely possible for a file to change in between when you call into the kernel and you have to de-dupe it and and when the actual de-dupe runs and what not so we gotta handle all of those things basically um internally we just use most of the clone code clone isn't a butterfly octal and I um I basically touched on it before but it's it's the way butterfly exports the user space the ability to move or not to move to copy extents from one file to another right so all we're doing basically is saying uh you know we're just doing the compare and then we're just cloning one extent over the other and letting it all fall out from the clone code basically clone code already handles that in those to orphan the extent that gets overwritten and everything uh so makes our life easy and shares bug fixes which is nice so oh any any questions on that take that as a good sign uh so yeah please please they can be yeah yeah oh for butterfast no it's 4k yeah that would be the minimum extent size yeah well excuse me um no so that's in that that goes ahead actually is a slide uh I do it on 128k I'd be happy to explain why if you'd like to know if you want to know I take it so when we're doing uh when we're doing all the I.O um and all the checks humming uh we my duper move program the user space software has the option right you can do 4k up to 1 meg basically right the I.O the check something everything just slows down a lot if you're chunking it up into 4k chunks additionally uh you really start to fragment you know uh you're here I would say your risk of fragmentation gets higher right so it's kind of a balancing act of how quickly do we want our I.O to happen right and uh how much would we like to avoid fragmenting the file system so um and I actually didn't have like a great answer right off the bat right so I'm you know I'm talking to my boss about this I'm talking to other engineers and really I just experimented a lot I did a lot of runs uh you know different block sizes and looked around actually and looked at what other software did 128k seem to be pretty common for like a default a default block size if that makes sense yeah so from interspace uh you know the block size that you compare at is basically up to you know up to the user defaults at 128 internally the file system absolutely can have extents you know 4k to uh I want to say 256 megs would be the biggest extent I want to say that so you know does that answer your question oh okay cool cool cool so if you want to check this out it's on it's on github this is this is the user space part right uh the kernel part is in the kernel um and if you guys want later I can show you the kernel code uh yeah yeah that's absolutely correct no that means that for your duplicated blocks yes the IO happens twice absolutely yeah yeah I won't lie I'm not gonna actually we magically do it only yeah what's that sorry no no no the kernel uses the page cache yeah absolutely yeah so it's hopefully yeah right the worst cases yeah absolutely absolutely yeah the worst case would be that you do the IO twice average case just depends on your set if your set is lia if your set is way bigger than your memory probably it's gonna happen twice if it's smaller than your memory it's gonna react makes sense okay cool cool yeah uh let's see so do promote by standard file base interface I just want to get across it it's like your standard Unix program I try to make it as close to you know cat or cp anything like that I right now do do do remove scans everything builds extents from the scan blocks and then submit some but if you don't mind I'm gonna write that down because that's actually a pretty good idea um submitted right away the reason it doesn't submit it right away right now is because we don't know ahead of time right how many because I want to cram as many D dupes in one I octal as I can so I don't know ahead of time like how many duplicates I would find right but there's there's a possibility of doing that in the future I have a feature where again we're getting ahead but it's okay as long as everyone's okay with that I have a feature where do can write them to disk in a hash file that'd be nice later for when we scan the hash file or something I have a feature coming up where I will use the hash file and I will store transaction ideas in the hash file so then I can just query butterfs based on a sub volume and say hey what is the last transaction ID on this sub volume and then I'll and butterfs will be able to tell me then which files have changed yeah yeah so that's that's the uh I'd say that's probably the last feature the last big feature I had in my head to make you know to complete the project I guess uh so yeah okay any any more questions straight forward okay cool good most of you have deduced this already actually but yeah so the the different one basically works on three step process right so you have a file scan stage and we know over why we use 128k default you can optionally use the temporary file I recommend this so one of the lessons I learned also is if you don't do this you will eat an enormous amount of memory so I'd say about a maybe 18 months ago I had to I had to work something in basically because we were finally hitting data sets I was finally hitting data sets we're like oh wow okay you know like this is more hashes than I have memory for basically right so now you can you can use a temporary file and that expands the amount of the space you you know you can scan basically by a ton um we then take all of the allocated blocks and I make extents out of them and this goes back to wanting to reduce the number of Ioptals I'm calling into right so I have an enormous table of extents that are duplicated right I'm sorry I should say blocks that are duplicated right uh and what I what basically what I do is find all the dupes that create you know a multi-block chain essentially right and I submit those together so two reasons for that one is in the interest of you know calling into the kernels a few times secondly I'm also just trying not to fragment basically right so I don't want to have like you know a meg and then be chunking it out into 128k ddupes right I want to coalesce all those 128k blocks that I have and turn into a meg and then dedupe that so that's that would be the intermediate step and then the last step is essentially an enormous loop of call into the kernel you know ask for a ddupe get back my status and what not yeah and then the hash file is actually one of being really useful for testing so I actually have instructions on the wiki for people if they like to just see how much you know how quick a step right one of the stages is or you know how their hardware handles it they can isolate the three step process right and write everything into a hash file first and then dedupe from that hash file right and then my great features eventually will tile that together and then you can run it with the hash file that you have from your last run it'll just dynamically update it so hopefully by next year you know when I do this talk I'll have that I would say a lot but I had some people use it for things other than dedupe because it can just hash a lot of files very quickly which has been pretty neat so just an interesting point any questions about the high level view yeah please oh no no no it's paralyzed yeah yeah I thread the heck out of everything I can yeah the step in fact if you want to know basically the first and last steps the file scan and the dedupe steps are heavily threaded so as many CPUs as you have or as you tell me I can use I'll make that many threads I've not been able to thread it yet it's actually very CPU intensive too yeah because it's literally not allocating memory it's not going to disk at that point it's just walking this immense data structure and putting extents together so it's actually led into some discussions of maybe allowing people to optionally skip that step if they're fine with just just setting a bunch of raw blocks to dedupe later yeah but yeah the first two steps are heavily heavily threaded that's where I got an enormous amount of performance that picking a better hash helped a lot too a bunch of things so any more questions about this step so again most people that they want to know is like okay well how fast does it work how much dedupe do I get so this is my test the test I do is basically I copy my home directory to a test machine I copy slash home to a test machine and I run dupe or move against it right now it's about 750 gigabytes takes about 45 minutes 44 and change basically yeah so that's about like what an hour a terabyte or something like that that's fully scanning it categorizing deduping it um last year at the talk that took two hours and then before that it didn't finish it so this is again just a you know part of the talk is I just want to explain like sort of the evolution of this project right and that's definitely with one of the the evolutions like the very first thing anyone ever asked was how quick is it and how much can it dedupe basically yeah so yeah and this is it also gives you a decent idea of like what you might see on on a home directory right so 70 gigs 750 gigabytes you know what about 10 percentish right and my home directory obviously I'm not going to share the logs of it but it's it's mostly just mostly you know photos, music, your usual stuff yeah right I'm not not going to share the hash file of my home directory but yeah I'm desperately looking for something I want something that has like the exact same pattern but is not my data you know what I'm saying that I could put up somewhere but yeah so that's yeah I mean like I was actually surprised 70 gigs it's not bad for for what I thought was pretty like not deduplicatable you know so yeah I don't have source code and I have a separate partition for my source code but source code doesn't deduplicate that great either I found out yeah yeah that's my big feature for the next revision so version 11 should have that yeah yeah yeah that's where I'm at now I've got it to where everything else is fast enough now that that's something that can be on the horizon that makes sense yeah yeah and I'll use that using transaction IDs from ButterFest basically if you use if anyone has used a ButterFest plan you command it's essentially the same thing yeah you can give it a sub volume and it'll tell you like what is changed it's pretty sweet actually it's really cool yeah please yeah right so that's uh I have someone I know on the ButterFest IRC channel that is doing something along those lines um I could do that yeah right now my idea is to have it something you run right and then maybe it'd be run by another daemon right uh uh you know so then you would run it and keep the log somewhere but it's it's not a bad idea yeah I noticed at least one other person who's doing that with their own custom daemon they have uh they have a really specific use case and I haven't actually gotten the code from them I think they're just keeping themselves which is fine um but uh yeah it's definitely it's a good idea so yeah I'm not sure if I'll get to that or not cool alright uh so use cases um I say medium size because I don't want to over promise like you know so you have a good idea right you do terabyte in an hour so maybe pointing it at a petabyte uh maybe not yet right um but yeah um I say medium size because I don't want to over promise yeah oh just once just once yeah it'll make no difference or sometimes you'll discover a little bit more and there was a good reason for that and I don't quite remember um it actually has I think it has to do with the extend search though yeah uh yeah but that's generally what you would expect yeah yeah yeah you'll get it out and then that's basically it um yeah so virtual machine images was actually one of the first use cases that was that was presented with uh and it's I say sometimes uh people just need to be aware that this is fragmenting your your virtual machine image right so for a lot of people if it's mostly read only or something like that that's great they love it then right but if you have like a really busy disk you probably don't want to do this make sense cool last thing I don't like angry users so yeah go ahead yeah please please uh I personally haven't seen an enormous impact yet I'm waiting for it but uh yeah I haven't seen anything myself personally I think the SSDs I mean taking out the seeks really like yeah oh okay so are you asking why um are you asking yeah yeah right actually for both I'm talking about both of those cases basically yeah yeah it will absolutely yeah now blow up I don't want to blow up you know like that's a strong term but like you will you will be rewriting uh you know the extent pointers that did the extent data the sets that I talked about right so presumably you could be splitting one up uh so that might introduce a metadata overhead right no but it's it's no I mean not not in like like if we're talking about one file no right but it's something to be aware of if you're running it on a lot of files right or yeah does that make sense yeah yeah no no I don't have like an actual no no that's true yeah I don't have an actual number yeah it's more of a just understand that this is what happens right so if again like if it happens once that's your I agree with you it's not really it's not a really big deal the leaves are leaves are about 16k in butterfests so there's plenty of room in a leaf for extra extent pointers and what not but yeah it's just it's it's one of those things where if you do it to one file ten files a thousand you know right things can could blow up basically yeah make sense any more questions uh so I'm a kernel hacker so we just copy like the RB tree code from the kernel and then I store it in the RB tree yeah basically so in in do per move we are storing yeah I have an RB tree basically um and I'm keeping the hashes in in that and then I'm using a sequel like database on the back end to store them temporarily basically okay a red black tree does that binary search tree basically yeah yeah yeah yeah so I don't need so there's there's two modes that do per move can run in if you don't give it the back end a file back end it's going to use all the memory it can use basically in that case you load up an enormous RB tree right I haven't had a problem so far I guess that's the right now it's right now it wouldn't be used if you're giving basically the file back end right we still use we will still use a tree but it stores a lot less like an enormous amount less we just keep basically the ones we know that are already deduped or the ones I'm sorry the ones that we know we found duplicates for will keep in a tree just to make a subsequent search faster yeah this is does that make sense yeah yeah yeah right yeah yeah yeah so what do per move is it's not passing up hey this is what I know is on disk as an extent because you're right it doesn't have that actually it can discover that information via by map but it's it's not something that comes naturally right no you're absolutely right we're passing up the logical extents not the physical extents so yeah so do you remember is working on the logical extents right and passing those those into the kernel and then the kernels resolving that figuring out where it is on disk yeah absolutely oh yeah yeah absolutely yeah absolutely yeah yeah this is all handled within the clone code so this would be the same thing if you clone one portion of a file into another you're essentially doing the same exact amount of work yeah so for example as you as you pointed out if there's already an extent where you're cloning into you're going to blow it away and you might wind up splitting it into two extents if what you're if it's larger than the area that you're cloning into right so you know I mean you can wind up from one extent to three extents essentially for one portion of a file because you'll have the two endpoints right that were from the old extent and then you're nearly de-duplicated third extent yeah that gets into the overhead we were talking about yeah absolutely it's it's basically the trade-off you make right again like on a perinode basis it's not really a big deal you know but like you know if you have like a really busy virtual machine maybe I don't know yeah it's again it's something to be aware of to understand oh right right okay okay yeah yeah well now with the sequel light back back end it's pretty much oh oh before that oh yeah before that oh okay you could basically you could do the math essentially and say well I don't remember the exact overhead but maybe it was like you know a couple hundred bytes maybe right for the node the arbitrary node right and then you could you know I have a terabyte that I'm scanning so divide that by 128k and then multiply that by the overhead of like the arbitrary node right so for example now if you want like a a more uh yeah what's that sorry yeah yeah it would be something like that yeah yeah exactly yeah yeah yeah exactly yeah yeah um and that's if you run it without using the temporary back end basically turns out the run I have not find now this is running on this system right with the a nice ssd and whatnot I haven't found a really big performance difference from using the sequel light back end which is pretty nice so I was happy with that um I'd say a few minutes honestly if I'm going from memory anyway I have it I have a lot of these numbers on the wiki as well so every time I do a release I run this is kind of the test that I run just to make sure okay nothing broke right bring in a patch go you know set this off right and then every time I do a release I'll and put up the numbers so I track you know well how have we improved or regressed or whatnot cool any more questions about this page or anything in particular all right all right so and I think that's getting near the end of the slides uh this is essentially um what bugs I found really uh nothing's perfect right you start you know and you find bugs in anything so the first thing kernel locking is complicated when we're doing this we're locking two inodes down and that doesn't happen very often in the kernel um it almost never happens the closest you get is if you're doing like a rename and then you can lock multiple inodes in that case you still don't lock down the data for the inodes so this was an exercise in nesting an enormous number of locks I would say one of them and that includes like the inode mutex so it's probably not you know super surprising that we had one or two issues at some point clone was locking in the opposite order so that that caused problems right if you did a clone in a dedu um the ones that I'd say ever actually showed up clone I just found by reading read page was doing the opposite or I was I should say extend same the I often was doing the opposite order of read page and that would show up as rcing hanging on some people's systems so I get a bug report and someone would tell me why is rcing hanging on the system I'm deduping and then I go and I oh look we're hung in read page huh I wonder why yeah there we go so um so to say that was probably the biggest bug that I fixed uh that you might have encountered if you used it before basically the another interesting one was we drove rcing crazy uh clone changes m time and c time on the target inodes and rsync does not like that because it's okay this the file has changed right yeah so that the next one you know it gets out there and then people are giving me feedback and one of the one of the bits of feedback was this is great it's deduping why are my backups so slow now right it's like okay yeah so maybe we shouldn't do that so you know I updated this one of just being a kernel patch and we just skip that update basically in kernel really easy um but important um the other big one was we weren't deduping the the tale of files if the tale of a file was not aligned I was aligning down the request and it turned out to just be incredibly wasteful uh because we would never remove those extents or you'd have like this tiny tail that wasn't deduplicated right on all these files and that that was just basically a bug fix patch type thing so yeah pretty straight forward um and all of this is if you have latest kernels is it all fixed in the latest kernels and whatnot so alright oh were there any questions about this so far okay and then so down sides as you do but I've touched on most of these really um the one I haven't touched on this one uh don't dedupe your backups so when you do that you're just you're putting everything in one place to fail right so you have a single point of failure right don't dedupe your backups uh understand the disks fail disks might fail and just one part of a disk might fail right so another reason why you don't you know right uh you want to be careful what you dedupe right if it's something critical it might be better that it's duplicated on your disk please oh that's um non-SSD hard drive oh okay maybe not a nice term I guess um so on those you know you could you could get more seeking right because you're you're fragmenting the files intentionally to dedupe duplicate them yeah sure well okay so I would say my objection my only objection to deduping the backups is your your uh well your deduplicate so you're you're you're taking potentially critical data and you know making it a single point of failure but you're right if you have like a raid or something um so it's it's maybe uh a guideline if that makes sense yeah yeah yeah yeah right then you have more backups too right yeah yeah yeah also it also if your backups are if you're replicating your backups then you should you can dedupe them right because it's you know right um so yeah it's definitely case by case right I just mean I don't want like the poor person is just like me I just back up to usb storage so if I maybe not not want to dedupe that because I don't trust it to be super reliable right okay okay right here right yeah that's right our snapshot yeah in fact I use the rsnap so I'm kind of deduping my backups already in that it's true yeah yeah well yeah it's on the file level yeah yeah yeah so it's not necessarily due but yeah it sounds funny to say that so um yeah any other questions in that part so far oh yeah please yeah there's a there's a defrag process in butterfs I have not had enough time to look into it because that's it's one of those things that I want I need to look into because it could be something I could tell the users right so hey run your dedupe right and then afterwards run this defrag command right you know to put everything nice again or what not right but I'm not clear yet because I haven't had the chance to look at it as to how exactly that defrag works um so I know that they exist fine together because I have them both turned on and nothing's blown up and you know um there's no reason why they shut it but yeah yeah I don't know enough about it yet um but that's my hope is that that could fix that I have some other ideas for it too anyway um I don't know about it afterwards but does that generally answer cool cool cool so the um the last thing that I found and this was interesting was bookend extents in butterfests they don't know what a bookended extent is yeah okay so I didn't know either and I was like oh wait it's what you call this bookending okay I get it yeah so if you have say so butterfests is a copy of our knife files right so you've got a large extent okay and someone writes into the middle of that extent right uh when they cow that extent let me make sure I get this right they might not copy the entire extent right because if the only part that's changed is in the middle right so what they'll do then is they'll write the newly changed data in its own extent right and the old extent becomes what's called a extended extent and only portions of that extent are used now right because those extend data items that I referred to later are rewritten so that as you read the file logically you know you go okay jump to this extent right read this jump to the new extent and come back right well those extents cannot be split in butterfests so you lose that space for the for the entire time that that extent is referenced right now this doesn't happen that often it turns out one of the first things I learned doing file systems is people almost never rewrite their file right 99.9% of the time a file is written once and that's it yeah so you know but definitely yeah definitely something I found so that'll be something I'll be looking at fixing and it should be an exciting project exciting yeah yeah absolutely yeah yeah so it's kind of you know again it doesn't turn out to be a problem because most of the time the extents are really nicely laid out and the duplicates are very nicely laid out too most of the time you know but it's definitely an issue that and it's an issue for more than just deduplication too right because it's just a space issue right you can have potentially gigabytes of or maybe hundreds of megabytes of data an extent that's no longer referenced but it's pinned to the file system you know again doesn't happen too often but from then understand so any questions about this stuff cool how am I doing on time I think oh wow okay cool alright plan features I talked about all this already the incremental dedupe based on the fine new command that's like for me sort of completed feature wise it'll make me the happiest it's we originally didn't support deduping within a file and I don't remember exactly why that wasn't possible it's possible now so the next version will have support for doing that right that just means that a file doesn't dedupe itself right now if that makes sense right so that's but it's possible it's just the code needs fixing so and OCFS2 can also copy and write the files so OCFS2 at some point will get a patch for this oh and OCFS2 no oh yeah sure sure oh yeah yeah so it's turned off by default right now and that's just because of other reasons basically but yeah so we'll fine map the files so this happens instead of dedupe remove and then and the kernel just understands the holes as they as they are this is part of the clone code right because it won't find the extent yeah yeah absolutely yeah absolutely yeah so we fine so we fine map the file before we scan it right yeah it's it's turned off because it has nothing to do with holes it has a lot more to do with what Butterfest Markz is shared so Butterfest if it has a ref counter greater than what it marks an extent shared and I also try to optimize out shared extents right I say well if an extent is shared hey it's probably deduped already I don't need to do anything to this right um it turns out though if it's shared if it only has a ref counter greater than one which means that it might not be sure you don't know who it's shared with basically is my right so you just said it's shared but you don't know okay it might be shared file this portion of file a might be shared with file B but I didn't also begin file C and you know right yeah yeah so um so those are rolled together unfortunately in the code so I just have to unroll them basically and then I'll use fine maps to just always skip the holes yeah yeah another thing whole sparse files are luckily I found out not not super common either you have to support them um oh yeah that's true absolutely yeah that's true that's true yeah I mean if you have a database that's a different story right yeah but in the world of like you know oh I just un-tar some stuff or I just added a file or whatever yeah most of the time but um yeah good question though thank you anything else any more questions cool yeah all right these are that's my github page for do-per-move there's a wiki there and union packages for most distros of the open build service site so and yeah any any questions in general then what would you suggest people who want to they want to do kernel work more but they're really frustrated or which is the biggest thing what tool is one of the best tools right right right right right yeah that's a good question so I'd say um getting into kernel work uh patches obviously right everything's all put patches on the mailing list right the Linux kernel mailing list has an enormous amount of traffic I don't actually know that that's the best place to go to be perfectly honest right I mean I wouldn't say ignore it right you know definitely check it out but yeah it has an enormous amount of traffic I would say to find a project in the kernel that you're interested it doesn't have to be popular or cool it's just something that's interesting to you and learn that really well the nice thing about the kernel is say you're interested in file systems right that is something I can speak from right well it's really well designed right there's a VFS inside the kernel right and you know when I first got into file systems I was just filling in callbacks basically right but because it's all there and because it's fairly well designed as a whole right it's easy for you once you started that small space to sort of start to branch out right so then I learned more about the VFS just doing file system stuff right then you know oh look oh it's VFS too initially didn't support cluster where M map right so I got to learn about the memory manager implementing that so yeah I'd say um does that make sense I mean does that help yeah yeah I'd yeah yeah pick pick either uh something that you like that's in the kernel a device driver a file system um you know and just do small patches yeah like they didn't like you oh okay okay yeah that's definitely out of my wheelhouse is that the right term anyway um but yeah that but so but it sounds like you have a place to start right yeah so that would be something to check out and yeah yeah yeah kernel newbies is good linux net would be the IRC network there's a pound kernel but it's IRC like just going on there I mean oh yeah sure yeah yeah let me see butterfly IRC linux net is what it's called right it's in my IRC client and state that I forgot yeah but yeah definitely there's a kernel there's a kernel channel there's a butterfly channel there's no channel um yeah I would I would check that out yeah absolutely let me think yeah yeah a lot of people send small patches and get started that way like really you know one liner stuff like that printk printk printk oh printk uh it depends on debugging if I'm debugging the kernel um uh printk uh obviously like if I'm just if I if something I can run right oh yeah printk is the easiest right but honestly like sysrq like if I have a customer machine and say they have a hang right I mean you got to start with that so you look at stacks so you get a sysrqt right to get the stack of all the running processes I'm sorry all the processes on the system you go look at that um and then you work backwards from there right so like for example the rsync bug right that I was referring to well the way I figured that one out was I got a stack trace right from the kernel I said well look at that uh this guy is sitting in the acquire lock sub routine and this guy is you know already has a lock because he's sitting in another routine I know you know he's sitting in a different acquire lock sub routine basically right and then your head goes off I go okay let me go look at butterfast read page let me go look at extend fame and lo and behold we're doing it in the opposite order so yeah I'd say things like that gdb exists I haven't used it recently um gdb for the kernel I should say yeah I make full use of gdb in user space but why not yeah I'd say a virtual machines are very useful because but because they're fast because I can I have them I have it on SSD so I do have a lot of virtual machines I've got maybe four and and I'll test in those for kernel problems absolutely yeah they're great for you know because it's very easy to just iterate right they reboot very quickly you don't take your own machine down you don't take a test machine down you know you don't have to wait for the physical hardware to boot um and presumably you would be able to pause it I guess huh right that'd be kind of neat but I haven't yeah I haven't gotten around to I haven't had to do that really yeah yeah yeah yeah yeah so maybe not so useful yeah right right right right but um yeah those are that's you'd be surprised printk is useful can be useful any other questions I had some general butter up fast questions I've just heard a few things today about butter up fast in the uh the facebook of the um you know pushing the kernel code up yeah right they're saying hey butter up fast you know it's it's looking pretty good you know it's almost there for us we pushed up a lot of we've got we've pushed up a lot of bug fixes and a lot of them are we're butter up fast it's almost there for us also something similar in the gloss glossary fast right yeah there there was some um positive there in the server then I noted uh the it was a little butter up fast the effect that was more about someone out in the field and I heard this from maybe from other podcasts the data loss out you know on that so maybe if it's more I mean where are the good use cases right right right level sophistication yeah yeah so butter up fast okay yeah so butter fast moves very quickly um and it's constantly getting right yeah so let me get to that let me get to that uh so yeah you're absolutely right butter fast moves very quickly uh it's under it's actually quieted down quite a bit but it was under very heavy development for a while uh I think some of that um I mean you're always just going to hear bad things right you're going to have people are going to have bad experiences right that's why I don't corrupt the data right because they don't forget it right it comes up everything yeah exactly uh the other thing I'd say is at Susie uh you know we kind of curate the features right so now in fact with SP with service pack one we actually have a lot more enabled right but for example uh there were some problems with compression when we released sleep 12 so we didn't turn compression on we we curated the features we looked through we ran it ourselves we you know what breaks a lot what do we not understand you know uh and that's how we got around a lot of those sorts of sore points right where people might enable a feature right and then find out you know what well this portion of butterfests is is nice and stable it's had a lot of work on it this is like brand new and it's just trashing stuff right um so I would say that that would be the biggest thing for our customers for and it's mostly what I can speak from right because it's yeah yeah yeah yeah absolutely so yeah yeah yeah right right and if you're not a customer you could I mean presumably you could look and see anyway it's not it's not hidden we don't hide it you know all our codes up there and stuff so you can easily find out you know what did they turn off and stuff you know all righty I'll take any more questions afterwards I guess because I really shouldn't keep anyone here any longer so thank you all very much for your time I appreciate it thank you yeah thank you