 All right, well, thank you all for for coming as some of you know this this talk is kind of my stock in trade But I have not actually done it for over a year So this is the first criminal report in a long time and so I would just like to say it's nice to be back I can't think of a better place to get back into doing this As is my want I will start by talking about where we've come from and then look a little bit as to where we're going So here's what we did over the course of the last year The set of releases if you did it's quite a few of them in a sense There's not much to say about this because it's looked like this for years now We crank out a kernel every couple months or so the number of change that's goes up the number of developers goes up In fact, we're now up to Over 1,400 developers contributing to every kernel release. So this is continues to get bigger We add about 200 new developers in every any every kernel development cycle get 200 people who haven't ever Contributed to the kernel before so we have a lot of people coming in All this and you can see well you can see a couple of trends here. We'll get into this the first these course It's pointed out the number of change sets continues to go up. There was a period of time way back around here Where Andrew Morton famously said that the number of change sets in the kernel is going to have to go down over time Because we're gonna have to actually finish this thing Well, we're not there yet In fact that You know the trend continues upwards. I don't really see a whole lot changing with that We just have more and more stuff to do Again, if you look back to the site point out we got more developers coming in We've got more code coming in so you might think that at some point the process is going to bog down Because with more changes and more developers to try to coordinate and so on The whole thing when we think we get more unwieldy But that is the trend for the length of the kernel development cycle like this A year ago. I thought it wasn't really gonna get any shorter, but it has in fact still gotten a little bit shorter We're down to about 60 days to from start to finish to Burgeon a bunch of changes stabilize them and put out a stable kernel release I don't think at this point that it's gonna get a whole lot shorter than that, but you never know You know if you extrapolate the the line forward it hits zero somewhere in the late 2020s At which point very interesting things happen. That's the singularity But no, I think we're gonna hit a bottom limit at some point So as we put out these kernels with more and more changes in less and less time Are they truly stable at this point? Well one one thing you can look at is how many fixes we have to apply to them These are the the two stable trees currently run by Greg Crow Hartman There are a bunch of other ones out there so you can see for the 3.10 kernel Which has been a stable maintenance for a year and a half something like that. We've had 61 updates containing nearly 4,000 fixes And for 3.14, which is rather newer. We've had 2,300 some fixes So one might say well that doesn't really look all that good. That's a lot of fixes But a lot of them are minor some of them are not but the real fact of the matter is that a Lot of the problems that are found in our kernels don't come up Until the kernel has actually been released. There are a lot of things that the kernel developers simply cannot test So we have to put it out there and as Linus once famously said that's really what we have users for us to test our kernels And and we really do honestly need them because we cannot find the problems other ways. I'll come back to that In summary even though we're putting in lots of fixes The truth of the matter seems to be that people are not all that unhappy with what we're putting out at this point The the kernels that come out work pretty well when they're released They stabilize well and make a solid foundation for everything that everybody else builds on top of them So I did a quick look at some of the things we added over the course of the last year I could spend all day just talking about this But we did add seven new system calls There's was seven new areas of functionality that we didn't have before some of them doing some pretty interesting things I'll come back to a few of these In the future we added lots and lots of features and I just kind of Separated out a few of them. We now finally have a deadline scheduler This is of interest to the real-time folks primarily But it's useful for streaming media and other things with deadline scheduling You do away with this whole notion of process priorities that have driven scheduling for so many years They said instead with a process you give it a amount of CPU time it needs in a deadline by which it must have it And if you're careful with your mission policies and such you can actually make guarantees that processes will meet their deadlines So we're pretty much the first general-purpose operating system to have this and be able to put it into Production use and it's going to be interesting to see what people do with it The control group subsystem is used for grouping processes together is used for containment and other sorts of things This went in some years ago and by general agreement. We did a really fairly poor job of it Is We're trying to do some new things and the design that looked good at that time does not look so good now So the whole control group subsystem has been reworked around a different design That we hope will be much more maintainable much more useful going into the future But of course we're not allowed to break things So we still have to maintain compatibility for all the old users kind of indefinitely It's kind of a problem so it's been a major bit of work to balance those two needs to make it maintainable and at the same time not break People and that has been pretty much pretty much done over the course of the last year The multi-cube block layer spreads out the our disk IO processing across multiple nodes tries to keep traffic node local and in general Scale our disk IO up to the needs of current systems. There's been a a Lot of talk about how the block layer and the networking layer sort of steal ideas from each other This kind of came from networking first. There are other things that have gone the other way Is it's just who's talking about the other day? So it's it's kind of fun to see what in a sense are the same problems of Sending data back and forth quickly to some sort of a remote node It's just a different interconnect and so they converge on a lot of the same solutions over time DRM render nodes if you wanted make use of your GPU, but you're not actually doing graphics with it You can use this facility to To gain access to the to the these resources without actually having to use a display on it We had lots of networking improvements, which again, we heard about the other day. We used to have Real challenges, especially with lots of very small packets Trying to run at wire speeds and that and there are workflows that want to do that high frequency trading and other things like that We've gotten a whole lot better Um, thanks to that guy right there and some other folks who've worked very hard on improving some of the core problems in the Networking stack and bringing its performance up To where we needed it at least a few years ago and we're getting it to where we need it to be next year as well And a whole lot of other stuff I just didn't have room for on the slide and of course we've added literally hundreds of new drivers We typically add I don't know 60 70 80 new drivers in every kernel development cycle. That's about it Depends on how you count of course But we have no trouble scoring you hardware We've added thousands of fixes and a whole ton of other things that I just really couldn't list here again I could spend all day just talking about what has been done in the I don't know 60-70,000 changes that have gone into the kernel over the course of the last year Just sort of summarized by saying that the process continues to run pretty smoothly We don't seem to be running into scalability problems with it at the moment Which surprises me we have of course over the history kernel development hit scalability problems at times with the development process And at some point we'll probably run into one again, but for now it seems to be running pretty well That said I'm a nervous sort of guy, so there's some things that I worry about So I'm gonna spend a little bit of time talking about those and then we can get into some of the more cool stuff That's going on as well. So I mentioned testing The that we need more testing the kernel because that's the only way that we we find problems The testing situation has gotten better in some ways So we've had Linux next for a better part of ten years now And so as a result we don't have a lot of the integration problems that we used to have during the merge window When all of these changes come together and flow into the mainline kernel For the last couple of years. We've had the zero-day build bot run by Feng Guangwu if you put a tree out there with with kernel changes in it He only he will find it He will track you down and he will find your changes whether you ask him to or not Pull them in and build test them and run some basic run runtime tests and so on and then you get this friendly Little email back saying hey, you just broke things And so then you can fix it and all these fixes go in before all this code goes in the mainline kernel I did a quick grab and there's something like about a thousand fixes in the kernel now that are credited to To the zero-day build bot and probably quite a bit more that should have been so it has really helped to to solidify our kernel releases Then there are a ton of tools out there aesthetic analysis tools and things like the trinity fuzz tester Which has been compared to a sort of denial of service attack Against kernel developers because as soon as Dave Jones points to that your subsystem You have to stop whatever you're doing and fix all the bugs he's found Which are often quite a few so again, we've got a lot of tools out there They're helping us to find problems. Hopefully before they bite users or hopefully before the bad guys find them That sort of thing so we're getting better in a number of ways and this is good, but we're getting worse in a number of ways I've heard complaints recently that If you look back say ten years ago We had a lot of distributors and other companies working in this area running extensive sets of tests on the kernels Find the problems and stabilize it then they all realize that everybody else was running all these tests and they didn't have to And so the amount of testing that's going on in companies has dropped quite a bit And so that's that's something we want to try to reverse And there's also kind of a long-standing problem that we've had and there's not a whole lot of self-tests built into the kernel itself And to an extent we've defended this by saying well So many of our problems are hardware specific and you can't really build that in unless you have the hardware to test it on So we can't put an automated test for that really and a whole lot of problems a workload specific and we don't have the workloads either but the truth of the matter is that there is a whole lot of testing that we could do because Something that developers can't do is they can't just ask a question I put in my changes as past my tests, but did I still break the kernel somewhere else? And so people still run into the situations where what they did works for them But somebody else Brings up the kernel and runs it and they find something fairly basic that broke that should have been caught so if we had a best set of really simple sanity tests the kernel developers could run just to Verify that didn't break any core functionality that would help a lot So the nice thing is that we do have one now. It's pretty rudimentary But there is a make test turn target that has gone into the kernel stream And this is going to be developed over time and should hopefully get better and help us to To answer these questions But we need more wide-scale testing and in particular an area that has been problematic for us is performance Performance issues everybody wants the kernel to be fast and high performance and strong and that sort of thing But we have a simple problem here that the kernel is big and it is used in a lot of situations So if you are say Working on power efficient scheduling to make your cell phone battery last longer You may not realize that that nice fix you put in just regressed somebody's Database intensive Numa workload on a huge machine by a couple of percent Because you just you have no way to test that you don't have it's outside of your world view entirely And so this change goes into the kernel We've just regressed somebody else's workload by a bit and they don't notice until three years from now when they've installed Somebody's enterprise kernel and suddenly things go slower And what happens to us is that we get a bunch of these things that go in and so they install this enterprise kernel And they've had maybe a couple dozen half percent regressions And suddenly things are way slower And so this is why people who runs their systems right on the edge big data centers that sort of thing are very very Leary about upgrading kernels because they tend to run into these performance regressions that the people who caused them Really cannot test for So we need to get more people doing performance testing with their workloads of development kernels So that we can find these things and not have to try to track them down years after the fact when it's a very hard thing to do So basically it really is why we have users kernel testing is something that we need to get everybody doing more of if we want to have a more solid kernel Different topic is the the real-time patch set Which one might say is indeed delayed at this point. It's been out of tree for many years Although much of that code has gone into them in mainland kernel The interesting thing is that the the real-time developers have proved an interesting point They proved the obtaining real-time response out of a general-purpose kernel is in fact possible This is something that people were saying you simply cannot do Especially the people who were selling other sorts of real-time solutions were saying you cannot do this But it has been demonstrated you can but you can really only do it if you can get the work support to do it and This case the the support for the real-time work has pretty much vanished So even the companies that are shipping this as part of their their products are not supporting it And so the real-time work has come to a halt And the work on integrating this stuff into the mainline kernel has come to a halt and it could in the worst case Kind of sit there and bit rod if nothing happens about this. This is a problem. We tend to have in general Especially with core kernel functionality. It's not really that hard to figure out who should be supporting the work to make a particular Graphics chip work or your particular piece of hardware Generally hardware manufacturers understand that it is in their interest to make Linux work well with their hardware But when we have the more core-oriented functionality, it's really easy to say well That's not my problem other people will work on this and it will all come about and then you end up in with Situations where nobody is supporting the work and it kind of fades away So we it's something that we have to watch out for it does happen because we can't just order people to work on things In this community we have to convince them that it's in their interest to do so Similar sort of issue is that of security If you want to look for bad news in the security area, well, there's plenty of it We of course had a lot of high-profile security incidents in The last year well, not many of them were kernel incidents, but if you think about vulnerabilities like Shaw Shock Heartbleed high-profile Compromises of large companies with subsequent disclosure of information who knows what else went on and The whole sorts of problems with the surveillance state that Evan was talking about yesterday We have a lot of security Concerns going on right now. I did a quick search in the CVE database of known vulnerabilities and found a hundred fifteen of them reported for the kernel in 2014 that number of course is certainly an underestimate because if you've ever looked in the CVE database There's a whole bunch of them that just say this number is reserved. We'll tell you about it someday So there's gonna be more of them than that, but even a hundred fifteen. That's a lot of vulnerabilities That's a vulnerability every three days showing up in the kernel and a lot of them are obscure and weird and hard to Exploit, but some of them are not it's it's not really the way we want to have things go There's a lot of old and un-maintained code in the kernel If there's one thing we've learned in the last year really old and un-maintained code can be the source of bugs The shell shock bugs were something like 20 years old There were certain bugs in the X Dot org server that came out a month or so ago that were like 20 years old This is kind of stuff can stay there for a long time I ran a little tool I have on the 17 million lines or so of stuff in the kernel and Something over 3 million of those lines have not been touched since that the kernel was first checked in to get in 2005 ten years ago, so we have ten three million lines of code that nobody has touched nobody's really looked at there a lot of dark corners in the kernel and There are certainly going to be unpleasant surprises in some of those corners We have a lot of motivated attackers looking for those out there coming from Criminal elements from governments from wherever else and not a whole lot of people working the problem of actually making the kernel more secure We have people working on doing Firefighting when problems come up. We don't have a lot of people just saying what can we do to make the kernel more secure? it has actually gotten a little better over time, but It's still an issue. So if you want to look for good news, well, we had 175 CVs in 2013, so One could try to extrapolate the trend downwards, but I wouldn't read a whole lot into that. It's too many either way We do have some people working on kernel hardening trying to address the effects of a compromise with Containment solutions that sort of thing, but it's really not enough if we are going to face the challenges that again Evan was talking about yesterday of really living up to what free software can do to provide systems that are secure and that Address the needs of their users and not the the needs of some other element out there Who wants to know about what you're doing and control it? Then we have to do a better job and we have to do a better job at all levels of the stack Not just at the kernel, but certainly the kernel has to do a better job if we're going to meet those challenges other challenges The year 2038 is only 23 years away from now And that of course is the year when the 32-bit time t-value will overflow and current times cannot be represented using that particular time format It is one can think of as a form of the the year 2000 bug It's a very similar sort of thing like that and one might say well, okay, that's fine. That's 23 years from now I'll be retired That sort of thing, you know in the worst case, I'll get to come out of retirement and make some money to help fix it But the simple fact of the matter is that there are systems being deployed now that will be active in in 2038 Putting computers in the cars or putting computers in the building control systems They're being embedded in all kinds of very deep places where they will Some of them will still be operating then they'll be hard to find and they will be impossible to update So we are creating problems now with the with the year 2038 Issue and we need to do something about that The the good news is that some people actually have this problem on their radar and they're working on it So the core timekeeping code in the kernel was fixed in 2014 So in the very deepest levels of the kernel everything is now ready There are no 32-bit signed time values in use in there But as you work outwards still within the kernel there are issues We have a lot of system calls. They have 32-bit time values in them You can't just change those because there are applications that use them You have to design new ones You have to figure out how to do that and some of these are hidden in very weird ways I mean there are the obvious timekeeping system calls But there are device drivers with iOctols that have time values in those and some of those are going to be really hard to find and fix Especially again to fix in a way that does not break existing applications. That's going to be really hard to do You of course can't address a problem like that Without involving the developers at the C libraries because they in the end are going to be charged with providing the interfaces that work in the Future so there have been conversations there is being thought about but I'm not sure how much action is happening there And at the application level well, let's just not even get started on that There are a lot of problems working there This is going to be interesting. I think we'll deal with it But if you have an application or a software at any level that is dealing with time values This issue should be on your radar now because we have to fix it now We can't wait for 20 years and then go into a panic because by then it'd be far far too late to do it So on the area of deeply embedded systems I got to have a properly buzzword compliant talk here. It just doesn't fit the Internet of Things deploying Computers into all kinds of possibly ill-advised places where you don't necessarily want them But they're going to show up anyway that people doing all kinds of fairly wild things with with their computers and The interesting thing is if you're going to embed a computer inside a light bulb or any of the other places where people want them these systems are very small and When you talk about small, I mean a lot of people will say okay, this is a small system all that but the people working on Internet of Things Applications think of a cell phone as being something like a mainframe compared to what they're doing It is not what they consider to be an embedded system They're talking about deploying systems with say two megabytes of installed memory on them Which of course is what the the first backside administrator had but we'll not get into that At this point, this is a small system And if you look at this plot which was put up at the kernel summit some months ago This is the minimum size of a kernel that you can build if you cut out everything You can carve out of it and make it as small as it can possibly be and this is now plotted against the kernel release starting at 3.0 and I didn't have to put a trend line on that as fairly obvious what's happening with the size of the kernel So at this point we're up to about one megabyte of memory for the absolute minimum kernel Which probably doesn't actually do what you need it to do and that takes a pretty big Chunk out of your two megabytes of memory before you even get around to the applications You actually want to run on the system So this is a problem for people working in this area and it's such a problem that we have people doing things like Deploying 2.4 kernels in tiny embedded systems because those are the only kernels that are small enough And that's just not really something we want to see I mean 2.4 is something we should really have left behind a while ago so It's a problem now and it's going to continue to be problem because the growth of the kernel is not going to stop It's not that people sit down and say you know what I'm going to bloat the kernel today because I like it fat Right and people are adding features that we want and we need and we use and they're going to continue to do so So the kernel is going to continue to get bigger But somehow we have to meet the needs of the Internet of Things people as well So there is a project out there I certainly like help for anybody who's interested in this called the kernel tinyfication effort So they're working on Ways to build a kernel that leaves out everything that you don't need so you can think about Leaving out device drivers or whatever and that's fine. We can do that now These people are looking at things like my application doesn't use signals So I mean carve signal handling out of the kernel Right, they're doing surgery in the kernel at that level to take out absolutely everything that they don't need so this is Some pretty wild stuff in a way you end up with kernels Don't look a whole lot like like Unix anymore because they lack really basic functionality that you expect to find on a unique system But if you're running one tiny little application in your light bulb, you don't need all that stuff So you want to take it out So there's a lot of challenges with this beyond the simple fact of convincing the the wider development community that is worthwhile If you add a Configuration option for everything that you want to carve out of the kernel You're gonna add thousands of them and that creates this incredible combinatorial explosion of of combinations you want to test and Maintain and you end up with this incredible unmaintainable mess. So that's that's a non-starter. It's not going to go that way So they're gonna have to come up with some of the way to do it Then there's the question of what do you do? When somebody has taken signals and everything else out of their kernel and then they complain that it doesn't work the way they want it to work and The answer pretty much is going to be well if you broken the kernel in that way You get to keep all the pieces that we really can't support that sort of thing But people are going to come and complain anyway because that's that's the way things are and of course the Continual problem of just keeping ahead of the ongoing growth of the kernel because as you see we're adding 12-13,000 changes in every development cycle. We're adding a lot of stuff and that's not really going to slow down So it's an interesting challenge. I don't know how these people are going to do it There's not that many people working on it, but they're trying but in the end I have to say is that either Linux is going to be suitable for this kind of application or something else is going to come Along and take this niche instead. I don't think that we can just Kind of rely on the continued dominance of Linux in an area like this if we don't work to meet the needs of the People working in this area and so I have this fear that somebody's going to come along this tiny little internet of things operating system And they'll put it out there and we'll look at it. We'll say oh look at it. It's so cute Look at it, but you know it doesn't do anything It can't possibly keep up with our thousands of developers and our incredible rate of change and our hardware support Heck, it's not even housebroken. So But then you know these are all the sorts of things that people said about Linux once upon a time And so you wake up one day and this cute little thing has grown up it has eaten your dinner off of your table and Has taken over a whole share of the market that the Linux perhaps chose not to try to address I think it's really something that we need to be concerned about if we want to continue to to be Pursuing world domination and if we want to perhaps avoid the possibility That this cute little thing that comes along is rather less free and less controllable than what we have now Which I think is also a real threat So enough worrying. Let's talk about some of the new interesting stuff that we've worked on Over the course of the last year or so Start with some stuff that's already in the kernel Talk about sealed files and many fe's which is a functionality. We didn't have before so what's a sealed file? Well, even in this part of the world, it's not that kind of seal Instead a sealed file is a file that has been mapped into a region in memory Had some data put into it and then rendered immutable so it can no longer be changed by anybody the contents of that memory You're essentially read only at that point and then you can add if you don't want to deal with the file I'm MFD just takes away the file actually hides it so you don't see it And it's simply a shareable memory area that you can Populate and then seal so that it cannot be changed thereafter. Yes, this code was put in for the 3-on-17 kernel You might ask why people want this sort of functionality and the driving Area here is distributed systems inter-process communication that kind of stuff if you want to Distribute your system across a messaging bus and you want to send a message in a region of memory saying please do this thing for me Perhaps this thing that requires privilege It's really kind of hard on the receiving side of the contents of that memory can change while they're working on it It leads to bugs it can lead to security problems and so on But if you can seal that memory and the recipient can verify that in fact the memory is sealed and will not change Then they can trust what's in there to not mutate underneath them And it makes a whole lot easier to build a stable and secure system So the stuff went in I don't think a whole lot of people are using it yet But the the intended user is a thing called All right. Well, there's always an outlier in every crowd and it's usually the same one The the the intended user is a thing called KD bus Which is a kernel implementation of the D bus inter-process communication system? So people ask why you want this in the kernel? Especially given that we try to keep stuff out of the kernel that's best done in user space and D bus has worked well in user space for quite a few years now And there's a few answers to this is actually a big long page of answers if you really want to get into it some of the developers feel a little defensive about it, but You start with performance if you're sending messages through a demon process somewhere Then every message requires a context switch over to that demon the context switch back to the recipient If you run through the kernel you cut out a couple of those contact switches And you cut out the the accompanying copying of the data and you make things go faster And so that's useful security because the kernel can actually verify the credentials of the sender and the recipient and make sure and provide Guarantees to the recipient that this message did actually come from where it claims to come from that sort of thing and Early availability if you're using a user space messaging demon Then you can't actually send messages until your system has booted up to the point that that messaging demons actually running if you have a kernel based system you can run it right away and Do your your early boot logging and other things you would want to do really early in the lifetime of the system So this code has been posted for review a couple of times the first time it was really pretty widely panned There were some security issues. There are a lot of complaints about how it handled containers that sort of thing It was really nice. So the complaints were very technical and not sort of oh this is more system deep people doing stuff And so we're just going to scream about it. It was actually a very technical discussion And as a result the developers responded well the answer the complaints and the second time around on the list was much more positive In terms of went went out there. They made some fairly significant changes to it So I'm known for being an optimist But I think this stuff will probably get in after another another round or two and we'll see Katie bus in the kernel in 2015 and then we will no longer be the only major operating system without a messaging system built into its kernel Virtual machines here. I'm not talking about virtualization. I'm talking about some sort of Abstract machine that you put into the kernel so you can write programs and run them within the space of the kernel itself And you might say well, why would I ever want to do a thing like that? It sounds insecure It sounds like something you should be doing in user space And people do say that but we actually have a few of these already So if your system runs ACPI, there's that nice little AML interpreter in there There's not only Interpreting code in a privileged mode in the kernel, but it's interpreting code that was written by bios authors So you should be very afraid Ned filter and especially in F tables are both based on virtual machines for for filtering out packets coming into the system the trace point subsystem has a filtering mechanism which again has a Virtual machine of sorts to allow the placement of conditions on when trace points should fire This actually shows the limitations of very special purpose virtual machines Because there's there's a an operation there You can test whether a particular bit is set in a variable is accessible to a trace point decide whether the trace point should Fire or not there was no way to test whether that bit was clear And so of course somebody needed that and then they had to complain He had to actually change the kernel to add that functionality to the trace point subsystem If you had a more general purpose virtual machine, you don't have to to do that sort of thing And then the last one I want to talk about is is BPF which is used with socket filters now It's the the Berkeley packet filter Was originally written many many years ago before Linux existed. I believe For use with tools like TCP dump again for filtering network packets quickly in a general purpose sort of way So you don't have to try to envision any way you might want to filter packets in your filtering tool And it's so it's used now within the kernel. It's been there for some years For to allow the filtering of packets going to a specific socket It was also more recently added to the secure computing subsystem So you can actually put a BPF program into the kernel That will Evaluate the system call requests made by a sandbox process and decide whether those those requests can proceed or not What has happened over the course of the last year is a whole bunch of development on BPF called extended BPF So there's been a lot of changes made to it the original BPF virtual machine only had two registers Which is not a whole lot to work with the new one has a bunch more has 64 bit registers as well a bunch of new Instructions which somehow very closely matched the the x86 64 instruction set and The reason for that of course is so that you can write a just-in-time compiler You convert your BPF program in native machine code and get rid of the the interpretive loop in the kernel and run things faster There's the ability to call kernel functions in a very limited sort of way You have to actually export a function to the program before we can call it You can't just make arbitrary kernel calls or that would be interesting and There's there's a verifier built into the kernel now That's we'll look over a BPF program and verify that it is safe for the kernel to run So the point that you can actually allow unprivileged users to load programs into the kernel So it makes sure the program doesn't access any memory that it doesn't have access to it has no loops So it can't run forever that sort of thing So that's in there and if it actually works as intended then It will verify the programs are safe There's a functionality called maps which are sort of associative array allowing the sharing of data between BPF programs and Either kernel or user space so all of this stuff has been merged at this point and the BPF subsystem itself Was moved out of the networking stack where it was originally found and into the core kernel and Positioned as sort of the the virtual machine for the kernel So we'll see it move into the secure computing Area soon, I believe and tracing filters as well patches for both of those exist as well I just saw patch go by adding it to the traffic control subsystem within the networking stack For forming policy queuing type things that sort of stuff, which is something I hadn't seen before There's also been talk of replacing NF tables that the net filter replacement Which has its very own virtual machine for for packet filtering if that happens That's not going to happen for a while though because those machines were designed with Different sort of use cases in mind It's not going to be an easy thing to do But we may see some pressure to unify those at some point Because BPF is really becoming the standard in kernel virtual machine and I think we're going to see over the coming years some very interesting applications of this ability to program things within the kernel and And add that sort of generalized functionality So that's moving stuff into the kernel You're also people working on moving crazy things out of the kernel So page fault handling of course comes into play when a process tries to access memory Which is not actually resonance in in RAM And so the kernel has to go find the contents that memory allocated page put it in there Slotted into the processes address space and let the process continue It seems like the quintessential kernel oriented task and not something you would want to do in user space So the first question that comes up when somebody proposes the sort of thing is you know why on earth Would you move page fault handling in the user space where it's going to be slow and so on and the answer here is Again virtual machines, but now we're talking virtualization type virtual machines KVM Zen that sort of stuff If you were running a big Sort of operation and you want to move a virtual machine from one physical host to another There are a couple of ways in which you can do it one of which is you can simply shut down this virtual machine Copy everything over to the new host and then fire it up over there But this can take a while and it can take multiple seconds during which time the service Provided by this virtual machine is not available and that's often not desirable or even acceptable So an alternative would be to just move the barest core of your machine to the new hosts and leave all the memory behind and Then as soon as that machine starts to run it's immediately going to start page faulting because none of those memories there So then you put those page faults events out to user space which can then handle the network communication and so on to get that memory off of the the old host machine and Put it in tell the kernel where it is and then allow the process to continue This actually allows you to keep a lot of unpleasant stuff out of the kernel that you would otherwise have there and allows for much faster Virtual machine migration, but I've essentially doing demand paging across the now So the patch set out there. It's been out there for a couple of years actually that adds this functionality There's a new m-advised operation to say this Parisian of memory is not going to be handled in user space You can call user fault FD to get a file descriptor where you can get notifications for for page faults and then You fill a page with the right Contents you use remap and non pages to actually map that page into the faulting processes address space And then you write back to your fault FD and the process continues So and everything goes and you have the functionality you need this is again used primarily with KVM and such Keith How does this differ from just handling a sake fault? well, I Mean you could do it as a as a as a sake fault handler in fact the default mode of this gives you a sick bus if you if you're handling your own page faults, but It gives a nice Interface where a separate thread can handle these because you don't really want to be doing all this stuff in a signal handler Because that's that's not an environment for that sort of thing So this is mainly intended for multi-threaded things you have one thread that can handle this sort of stuff Well, the other one just certain weight weights for it So this is a big and invasive patch. I expect it's going to take It's going to take maybe another year or two to get in because memory management stuff is like that But the work is out there and I think we'll see a go in eventually Then the final thing that I'm going to talk about another thing that some people think is crazy other people don't is live kernel patching actually Applying patches to your kernel while the kernel is running without any kind of downtime or other issues like that and People want to do this because rebooting the system to apply security patches of pain and depending on What you're running and what your operational requirements are it may not be an option at all So you really want to be able to do it on the fly There's enough demand for this That there's not one solution out there. I looked around and I found five There are a lot of options to choose from here But you can look at these you can actually scratch a bunch of them off the list kernel care As far as I can tell it's a purely proprietary sort of thing. It's not something that Whose developers have any interest in in working with the mainline kernel? So it's not of interest for mainline kernel development. Similarly case splice was once a Was once openly licensed but has now been taken proprietary as well So take that off the list because this is again It's not an option for the mainline kernel whether we wanted it or not At the bottom of the list the work done at parallels is actually sort of interesting It's not actually live patching as such what they do is they use the checkpoint restart mechanism To checkpoint every process on the system then they use the K exec system call to boot into an entirely new kernel They don't patch the kernel it bring in a totally new kernel with a fix in it and boot into that and then they use the restart Mechanism to restart every process that was running and kind of pick up where it was before So it's kind of a nice idea. You're not actually doing surgery on your kernel as it's running on the other hand The checkpoint restart mechanism has its own limitations. It doesn't work with every workload out there And so it's it's not really a general solution that works for everybody And parallels hasn't really been pushing it So I took that off the list as well for now, but that leaves to Being case twice or k patch and k graft put out by red hat and Suza so they both of course have interest in providing the sort of functionality of their customers So each of them has developed a mechanism here There's actually a lot of in common between them and that they both use the function traits or machinery to intercept calls To any function that is changed by a patch and divert those calls to a changed version of the function That's sort of thing, but they differ in pretty much every other way there's a whole lot of And they took a very different approach to how they implement this sort of stuff and so One might think okay, we've got two companies that are competing with each other that sort of thing they each have their own sort of Thing they want to put in to differentiate their distribution that sort of thing so they're not really going to Want to cooperate on this so want to merge them both But of course, that's not really an option. That's how not how we try to do things in the kernel We want to have one solution that everybody uses that everybody maintains and that works for everybody at all So the good news is that the developers here have gotten together and actually green on a common base layer that they can Both use so they've they put it out there. They have it in a tree They intend to merge it in the 3.20 development cycle So whether that'll happen or not I don't know because it hasn't been through a whole lot of review on the kernel mailing list itself Yes, yeah, and so there's always surprises that can come up there But if it doesn't get in 3.20 it will get in shortly after that And then we will finally after quite a few years have a live patching capability in the mainline kernel Itself and I think that's something a lot of people will find useful So just to sort of conclude I've talked some about some fairly crazy new things The thinking about things in a more general sense the problem with this crazy new stuff is that people actually use it now Even if we sometimes advise them not to if people use it they build programs and systems that depend on it And then we get to a point where? We have to support it forever because we have a strict rule about not breaking user space in the kernel and so This can be a problem because we're not always all that good at designing new features or designing new APIs You know the control group example again is a good one, but there are plenty of other things Where we have put something in them realize actually that was really dumb We shouldn't have done it that way But by that point people are using it and we cannot break it And so you end up trying to put things in a compatible mode if you do this for too long You end up with the kernel this full of compatibility code and it becomes very hard to maintain Going forward and that's the situation. We really want to avoid having happened But it's hard because it's really hard to figure out what to do when in fact You're out there. You're blazing your trails. You're doing things that haven't been done before We're long past the days when we were working hard at re-implementing POSIX Right, we're doing stuff that hasn't been done before and so we don't know There is no agreed upon way to do it So we're gonna run into surprises and it's gonna be a challenge. We've met this challenge so far for 27 years I expect we'll continue to for the for the coming years But it's it's gonna be interesting to watch how all this goes is we try to add all this new stuff to the kernel and On that note, I am done. I think I have a few minutes for questions if there are any Raise your hand for the microphone Sort of related to that last item. It seems like Way back when we used to get subsystems being completely tossed out and rewritten quite regularly Whereas these days, I don't know whether cause or effect of the development cycle these days We get much smaller incremental work Is it worth? Could people realistically get a large subsystem rewrite in these days and should they? So I didn't quite understand that the latter part If someone came up tomorrow with a new VFS that was 30% faster on common workloads But required rewriting 80% of the VFS subsystem Would that actually get merged these days? So if somebody came up with a new virtual file system layer? Well There is an example out there and that somebody has come up with a new block I O scheduler Which is below the VFS but a similar sort of thing that seemingly performs quite a bit better than the completely fair queuing scheduler that we have now This has not been merged because it seems like a duplication of the functionality we have now And so the plan is to try to evolve From what we have now to to where we're going and that's generally what we try to do Is not to replace subsystems wholesale but to evolve them over a period of time So you always have something that works and then moves in the direction that you want to go so You know it's been a long time since we replaced the subsystem wholesale Something like 2.4.10. We replaced the virtual memory management subsystem and that was traumatic So we try not to do things that way So that sort of evolution can be hard, especially if somebody has come up with their solution and it works now There's well, we can't do it that way. You actually have to work on this grotty old code that you were trying to replace but that that's That's really is what works best for us in the long term Okay, we have a question up there if we have a mic for them last one So you mentioned that we're accumulating API's that we don't want to really support anymore, but we don't want to break user space So there's a lot of echo down here. I don't hear it very well You said that we're accumulating API's that weren't well-written, but we don't want to break user space Do we think what some people do like a kernel 4.0 and just deprecate all those? Well, you we do deprecate API's and we'll we'll get to a point where we put in warnings saying hey This this API is deprecated But it's so far we can go with that Because in the end if we will break things we won't take it out. And so there was What was there were a couple of examples in in the 3.19? Development cycle one of which was simply putting the the bogomips value, you know The fake processor speed value in the CPU information file and slash pock for the arm architecture They took it out because the number made no sense to them at that point and somebody complained that it broke Broke their program and so it's going back in You know we try to do things and we will force things out You know the rule is that we can't remove API's the rules that we can't break user space It's actually you know a bit of a different rule So you know with enough nudging and pushing you can eventually get people there But it takes years it takes like five or ten years at best to do that sort of thing So and we had a question Oh Well, I'm told we're over time. So, um, yeah I'll be around you can go ahead and find me and You know, I'll be happy to talk to you afterwards. Thank you