 afternoon all. Most of you probably already know me. My name is Dave Chin. I work for Red Hat. I'm the XFS maintainer. So I generally work with file systems. I've done enough file systems talks. This thing on needs to be higher up. Try that. Can people hear me now? That better? Okay. Do I need to start again? Yeah. So normally I talk about file systems, but I've decided I didn't really want to talk about file systems this time, but I was looking for some kind of inspiration when the call for papers was out. So I didn't really know what I was going to talk about, but a talk generally for me starts with a flame war, a bit of a rant, someone complaining, a bit of shouting backwards and forwards, maybe an argument or two. So we've got to have something like that. So it would have been mid-June, something like that. This came along on the Linux kernel mailing list, a patch. Somebody said they reviewed it. And the reply from the author was, well, I'm not looking for reviews. I'm actually looking for testers. That kind of didn't really sit well with me. And so discussion ensued, and there were a few things thrown backwards and forwards, and I got to the point where I decided that I didn't have time for a flame war right then. And then I remembered the call for papers was out. So let's start with a few simple definitions. I mean, we can debate things. I'll just take some simple ones. I mean, computer science, it's really when you look at it, it's the theory and process behind what we do. It's provably robust logic constructions, verification techniques for your code, languages to express those logic structures and techniques. It's the theory behind it. It's the science behind it. Software engineering. You can see where this is going now, can't you? Software engineering is the process of producing software that provably meets end user requirements. It's making working software. It's building something. It's engineering. It's not science. Programming. Is this something we all do? We produce code. And of course there's a programmer. It's somebody who programs. They're nice, simple definitions and keep them in mind. Okay, so there's elephants in the room, aren't there? Programming is what we do. So if we're getting paid to this, we do it because we enjoy it. Why on earth would anybody consider it harmful? It's really essential. So where am I coming from here? Oh, this is the interesting part, isn't it? So putting up monkeys together and you get a self-extracting, exor-encrypted shell script that contains the works of Shakespeare. It's bound to happen, but I can't help but think there's something wrong with that. Code monkeys are an appealing concept, but I don't really like having to deal with code monkeys. And I don't think any of you really want to be a code monkey. You want to be actually doing something other than just randomly hammering on code and hoping that it comes out right. Because if you just hope it comes out right, you end up with something like this. Can anybody work out what on earth that code is doing? I can see by the stunned looks in the audience that nobody has got any clue. I still can't work out what it's supposed to be doing. It's a mess of ternary operators and so on. So is that a work of genius? So a real programmer, a sub-variety of hacker, one possessed of a flippant attitude toward some complexity that is arrogant even when justified by experience. Real programmers terrify the crap out of other programmers because someday someone else might have to try to understand their code in order to change it. Now if you had that code, would that terrify you if you had to maintain it? So it could have been genius. But think about it, the competent programmer is fully aware of the strictly limited size of his own skull. Therefore he approaches the programming task in full humility and among other things he avoids clever tricks like the plague. Nesting ternary operators is something you can do but it's not something very nice. It's a clever trick. It's a hack. It's a look how smart I am type of thing. On the other hand if we look at a few simple things perhaps there's another reason for it. A person sufficiently skilled in an art is required to be able to understand that they aren't very skilled at all. If you are unskilled you might know just enough to be dangerous but not enough to know that you were dangerous. You could be a risk to other people. You need to learn the fundamentals of something to realize how little you understand about that topic. So that code maybe it was just somebody who didn't understand what they were doing. There's different reasons that could be behind that. We don't know what it is but it could be a work of absolute genius by somebody who doesn't care about complexity because they're amazing or it could be somebody who doesn't know any better. So let's just take a recent example that some of you may have seen on Lynx kernel mailing list. There was, I hear that after some people know exactly who I'm talking about and what we're not. So there was someone show up who wanted to help and showed all sorts of enthusiasm to help us make Lynx kernel better. They wanted to help. So they started looking through the code and noticed that there were lots of fix me's all through the Lynx kernel code. So they wanted to try and help by fixing the fix me's because that's the obvious place to start, right? Well, no. But the thing is lots of help was actually given from experienced kernel developers. Ted Cho, Steve Rostat, Chris Mason, Joe Perches, Randy Dunlap. They all spent a good amount of time trying to help this enthusiastic young man do the right thing to learn the process, to do things somewhat safely or where his destructive capabilities have limited impact on everyone else. So we gave him documentation to read how to test and validate patches to send them properly so that they didn't make your eyes bleed or would just apply. And also why just removing fix me's was a bad approach to fixing them. The problem was that we kept making the same mistakes. The patches still were white space mangled or just wouldn't apply. He didn't go and read the things that he was asked to read. None of the help, advice and teachings that were given to him and there was a lot and there was a lot of time expended made any difference. What I'm trying to say here is the programmers who don't know their limitations actually affect the collective productivity of a project. And by effect I mean screw it. We waste time if people don't actually or do not have the capability to learn. So there's a lesson in that. To become competent in any undertaking the ability to learn is of primary importance. I mean you're all sitting here now what are you doing? Hopefully you're learning. That's why we're here at the conference. We're not just here to have a good time. We're here hopefully to learn something and apply it to our day to day life. I hope. There's a lesson in that too. Learning is hard. Knowledge is hard. From Elon Musk and his recent Reddit ask me anything. He was asked how he was such a smart guy who remembered everything, all this complex physics stuff and computer stuff and all that sort of stuff. And his comment to that was one bit of advice. It is important of you knowledge as a sort of semantic tree. Make sure you understand the fundamental principles i.e. the trunk and the big branches before you get into the leaves and details or there's nothing for them to hang onto. Knowledge is structured. It comes back to the Kruger Dunning effect isn't it? If the trunk is not there then the leaves just fall off and you don't know that the trunk is there. So this is what my dad said to me when I was 8 or 9 years old. I was trying to do something that I had no idea I couldn't. Don't try to keep the information in your head. You'll never fit everything you need to know in there. Instead I only keep an index in my head where the information I need lives so that I can find it quickly when I actually need it. So now you know my secret. Details or try to learn details and keep all of those details in your head does not make you smart. You might be able to recite them but it doesn't mean that you understand them. You need to know where to find the details not the details themselves. So if somebody asked me about some corner case behavior in XFS my first answer is almost always hold on let me go and look at the code. But I'll find the code that I need to look at in a few seconds and have an answer in a minute because I know where to find the answer. It's not up here. I've got terabytes of space and information and I've got all the Google. You don't have to keep all the information in your head to be smart. You just need to know where to find it. And if you're talking about code if you're not using C-Scope you should. So let's apply this somewhat to programming. This is something that Dijkstra said a long time ago about reliability of programs that we create. Reliability concerns forces to restrict ourselves to intellectually manageable programs. Intellectually manageable. That means it fits inside your head. You can think about it. You can turn it over. You can keep it as a whole as a structure tree of knowledge with all the leaves of detail in it. And that's how you end up with reliability. You can think about it as a whole. You don't miss a single piece of it. So with that in mind let's move on to another quote from Dijkstra and start to approach that from a different direction. So this was one of the things that was quoted to me in amongst that flame ore that originally motivated this talk. And I initially read this in a different way and took someone smarter than me to actually point out that there's another way of looking at this. If you carefully read its literature and analyze what its devotees actually do you will discover that software engineering has accepted as its charter how to program if you cannot. So how do you read that? How do you interpret it? That software engineering is a crutch for incompetent programmers? That's how I kind of, you know, wasn't said to me in a particularly nice manner so that's how I kind of interpreted it. I didn't reply to it. I mentioned to someone else and I'm not sure whether Neil's in the room here. No, probably not. He turned it around the other way. But what Dijkstra is actually saying is that humans cannot program. The competent programmers engineers know their limitations and hence understand that they need a larger framework for projects they can't fit entirely in their skull. Think about it. A bit of context. Again from Dijkstra. Programming is one of the most difficult branches of applied mathematics. The poorer mathematicians had better remain pure mathematicians. How many people here are trained mathematicians? Oh, we have a couple. Yeah, a few. How many of you think you're programmers? Okay, everyone here though writes code. This is where we're starting to go. There's very few people that are capable of understanding and applying the mathematics required to design and prove a program is bug free. So where do these mathematicians sit on the scale? Well, according to Dijkstra, there's somewhere that side of the mathematicians. Right up the, you know, there's not very many mathematicians that can deal with the complexities of software. Forget about your biologist chemists and I think the software engineers are somewhere on the other side of the sociologists. So how good are your math skills? So if we're talking about maths and using maths to prove whether a program is bug free and your maths are not very good, how do you prove that your program works? Well, you run tests, don't you? But this is what we hear about tests. The first moral of the story is that program testing can be used very effectively to show the presence of bugs, but never to show their absence. You can't use testing to prove that your code does not have any bugs in it. You can prove that it might do mostly what it's supposed to do, but you can never prove that it's bug free. You can't prove that it's perfect. But then if you want more effective program is you'll discover that they should not waste their time debugging, but they should not introduce the bugs to start with. So we've got a bit of a dichotomy here. We're not very good at maths, but we're being told that we need to be good at maths so we don't create programs with bugs. And we have to be good enough that we don't introduce bugs because debugging is a waste of time. I think everybody here spends more time debugging their code than they do actually writing it. I know I certainly do. I'm not a very good programmer. What it comes down to though, bugs are a reality in what we do. We are not very good programs. We're not very good mathematicians. And computer science really has not advanced sufficiently to provide us the incompetent with the tools that we need to prove that complex software is perfect. If we want to consider debugging a waste of time then we need better tools so that we don't create the bugs in the first place. But we don't have the mathematical background or capability or the tools. So it's not a good look. The conclusion I draw from this is that there are no programmers that are capable of writing bug-free software at a typical modern program scale. Sure, anybody could write a bug-free hello world program. Well, assuming there's no bugs in the C Library link too, or the Perl interpreter that you're using, but we'll just ignore that for the moment. Think about yesterday's presentation about the SEL4 microkernel. It was 10,000 lines of code that they've mathematically proven to be correct. It took them 10 years of work. And the worst part about that is that the verification logic is a couple of orders more complex than the code itself. That's only 10,000 lines of code. How big is the Linux kernel? We're almost up to 20 million lines of code. So if that's 10,000 lines of code 10 years a year for every thousand lines of code, what we're going to be here until the heat death of the universe. Though if mathematics and computer science gets better, then it might be a bit faster. Andrew Carroll just gave us the talk that was in this room about metrics and SEPH and so on. He mentioned yesterday that he'd read a paper from a conference on formal verification techniques earlier in the year. And the paper estimated that the pool of capable mathematicians that could prove that software was bug-free to be around 1,000 people worldwide. 100 of them were at that conference. There aren't that many people that can do this for the Linux kernel or pretty much any open source software you care to name. We simply don't have the mathematical capability to do that. And so it's Paul's friends that need to save us. Yeah. So perhaps we need to take a different track. Okay, so let's go back to some of the code. This came from Andrew Morton roughly around the same time I put the call for papers in. He was reviewing a piece of code fixed to the memory management code that was scanning page reclaim, scanning the page lists in page reclaim. He couldn't work out what was going on. He couldn't work out what the code was supposed to do or what the bug was that was supposed to be fixed. First of all, he calls the code absolutely awful. And he says that he's reduced to trying to reverse engineer the intent from the implementation. It's not very good code when you're at that point. So what we've already seen here is we're reduced to the fact that somebody hasn't written code very well, not a very good programmer. Or it's a bunch of people that aren't a very good program. Aren't very good programmers as a collective. This is not uncommon. It actually happens quite a lot. So what did Andrew say to fix this? Someone please prepare a patch which fully documents the design. Let's get down and review that. Once the patch is complete then let's start looking at the implementation. So what he's basically said is let's go back to first principles to fix the code. Design review, design documentation. Where does that come from? What process gives us that? It's not programming. It's software engineering. To fix bad code we need to engineer solutions. We don't program to fix bad code. What it comes down to is if you can't understand the design of the code then you aren't capable of proving that any specific change to that code is fit for purpose. That you actually have fixed the bug that it intended to fix or introduce the feature correctly that you intended to introduce. You have to actually understand the code before you can change it in any way. A different look on this. Ken Thompson was one of the original plan nine designers and implementers. You can't trust code that you did not totally create yourself. How many people here work on code that other people have written? Just about everyone. So what Ken is telling us is you should not trust the code that anybody else has written. I don't even trust the code I write. Come on. But if we put it in the context of everything else that we've just learned about, we have a kind of startling conclusion. According to the accepted wisdom, human programming capability doesn't scale in a trustworthy manner beyond one brain. That assumes that that brain can fit the entire problem space in it. My brain is not big enough for the Linux kernel. It's not big enough for most of the things that I work on. I need help pretty much because programming as an exercise, as a pursuit, it's not a viable method of developing large software projects. We just simply don't have the capability to comprehend the complexity of the systems that we're working on. And as we well know, Linux fits into that category. No one person holds everything in their head. We need some way of dealing with this problem. Linux does actually have, as a project, a way of dealing with this problem. What we've got really is essentially a hierarchy of diverse knowledge, skill, and coordination means through communication channels that we all use. And we have fairly well-known, constantly-involving processes. We have some kind of external structure around the code that we use. A bit of guru meditation perhaps. You sit down and you talk. Gurus talk to sub-gurus, sub-gurus talk to gurus. You've got gurus in training. You have productive minions that produce all this code that the gurus look at and bless and pass on up to the guru at the top who's sitting in the corner there. Of course, right at the bottom, you've got to have minions in training. Who's going to replace the gurus that we have now? They've got to come from somewhere. And from the earlier example that I gave you, the gurus at the top are even trying to train the minions right at the bottom as they first come in to learn the processes, to learn the habits, the techniques to build that structured knowledge tree in their brain that will enable them to reach the top of the tree. What it comes down to is that any large project, it's not programming that matters. It's knowledge transfer. It's the most critical process in any large project. You can't sustain a large project just by programming alone. You have to be able to communicate what's in your head to the next person because what's in their head is something different. It's not the same as what's in your head. That structured knowledge tree has a different structure in their head. It might contain different leaves and branches and it might be connected up completely differently. But if you don't try to communicate and transfer what you know to them and vice versa, the two things that you work on will not work together. And if they're part of the same project, then that means that project is not going to succeed. You have to be able to talk, communicate and transfer the knowledge that you have from top to bottom, from left to right to anyone that's in the project. And not discriminate just because you don't know who that person is and you've never seen them before. You have to assume that there's competence until it's been demonstrated that there isn't. A classic case of knowledge transfer that we have in the kernel, memory barriers.txt. Now, this has started off as just a simple manner of documenting some locks that we have in a couple of SMP read barriers and write barriers. But now, as Mel so eloquently put it, if documentation.marriedbarriers.txt could not be used to frighten small children before, it certainly can now. It frightens me. Paul is mostly responsible for that document and I... Pardon? Paul says thank you. But I really don't want to know what the inside of his skull looks like. Because that's the way he... That's the amount of knowledge that his head contains on concurrent programming and so on. I'm glad that he is the expert on that stuff and we have him in this project. There you go. He says he used the same trick my dad did. The same trick that I use. There's a structured knowledge tree that you know where to find it. But here's the scary thing as Jonathan over here said in response to the same thread. There may come a point in time when understanding memory barriers.txt will be mandatory for working in much of the kernel. That is a scary thought because that is a horrifically complex set of semantics that if we have to code around that, I'm going to go sit in the beach and drink beer. But what a comes down to once again is that that stuff doesn't fit in my skull. Just like all of XFS doesn't fit in Paul's skull. But we have to trust each other that what he's putting in that document, what's in his skull, is correct and it works. Because if it doesn't, all of the stuff in XFS that uses all of the locking, the RCU, that just breaks. You have to trust the other people, the knowledge that they have, and they apply to the project that you're working on. More than you need to trust the code that they write. Because it's the knowledge that they apply that then you then use in your own code. To trust is. So when you think about that, consider again Andrew Morton's comments about that code, that patch, where he couldn't understand what the bug being fixed was. He couldn't understand the code. If a guru needs to reverse engineer the code to understand the original intent of it, then we have a failing of process and a failing of knowledge transfer. The code wasn't documented well enough. There was no design. Where did that code actually come from? Well, in a way you could say that programming is at fault there because that particular code originally was a result of a forklift upgrade of the memory management subsystem. Back in 2.4.10, Andrea. I think it is. I never remember his last name. Linus replaced the whole VM from him. It all came out of his head. One programmer knew the whole problem space. I don't know how he kept it in his head to tell you the truth. But he solved the problems. Not everybody else understood it. And over time, all of the other heads that have been bashing against that code and bashing against that design, it's morphed. It's grown organically. But it's never been documented what it was originally supposed to do. There may have been a few people that understood it at the time. But that was many, many years ago. And without decent documentation, we slap a band-aid here, we slap a band-aid there, and all of a sudden nobody knows how it works. That's not a failing of our process as such as it is a failing of our knowledge transfer. The process needs to enable that and it needs to enable it such that you can understand the code you've written in five years time. And somebody else can understand that code in five years time without picking your brain. So let's go back to the original quote, full circle. This in itself, as far as I'm concerned, embodies a fundamental misunderstanding of the development process we use. When the process that we use is not understood, that process that brings all of our skulls together, when we don't do that, what do we fall back on? We try to fit it in our own skull and we program. And so we don't understand the whole problem. What do we get? We get bugs. Single brains aren't sufficient to keep our code in good working order. Sorry, I've lost the... We try to avoid this. And so we have a reviewer's statement of oversight. And one of the things that that is supposed to you say, when you say I've reviewed the code, reviewed by David at fromalbert.com, one of the things that I'm asserting when I say that is that the code is free of no one issues which would argue against its inclusion. So if you look at a piece of code and you say you've reviewed it, did you compile it? Did you do basic checks in a test environment? How can you say that a patch of any significance is free from no one issues if you haven't actually executed that code in a test environment? Having executed in code implies that you've actually compiled as well. But people submit patches that don't compile and sometimes because they're big, long patches, you don't notice that there's a missing semi-colon somewhere that causes it to fail. The code usually isn't sufficient to determine that the code is ready for inclusion. The patch is ready for inclusion. A thought review like that requires the ability to fit the entire problem space in your head and fully understand it. And we've already established that code that is complex, we can't fit it in our heads. Programming requires you to fit things in your head. So you can start to see where we're going here. If we can't keep the problem space in our heads, it's naive to think that we can do the review in our heads. A thought review doesn't require the reviewer to actually reproduce or verify the results claimed by the proposed change. It's an important word in there, verification. Just looking at the code doesn't even take into account the wider context of the problem being solved as well. There is actually more than just code that we need to look at in a review. So again from Andrew, any question which a reviewer asked should be viewed as a defect in the patch. The patch isn't finished until people can read it without having questions. So if somebody makes a change and you're sitting there scratching your head saying, I'm not sure why you're doing that. Then it's clearly not past review. Review is more than code. Does the description of the change adequately describe the problem being solved, the bug that is occurring? Does it explain why the change needs to be made and why there needs to be done that way and not some other way? Is it the only solution? Maybe, maybe not. The reviewer will ask, oh why didn't you do it this way? There needs to be an answer to that. The commit description in the patch should have all of this stuff in it. So it's there for the reviewer so they don't have to ask questions. So when somebody comes along three or four years later scratching their head, oh I wonder why this code's like this. Oh hold on, I'll go and look up the change history. Get blame. Oh that one there. Oh that was the problem. How many times have you wished that your commit message was better than fixed problem? What problem? Pardon? What commit message? That's simply not acceptable as a patch. Sometimes patches are simple enough that just the single line description is sufficient. But anything that's changing more than a couple of lines of code is going to require more to describe what the problem was it was solving and why it needs to be solved. Things, you know, we might also need documentation for it. You know it's an API change. Maybe there's man pages that need to be fixed up to go along with it or maybe there's something in the documentation directory that needs to be updated. There could be a whole heap of things that lie outside just the code change being done. And the thing is none of these are part of programming. They have nothing actually to do with the code change being made. They have everything to do with the context in which it's being made and the processes that we go through to decide if the change is good or not, what risks we need to take into account when making that change, so on. The moment I say risks we're well outside of programming we're definitely into the engineering environment. Engineering is all about risk management. And if you don't know why, what, how and whether it works, you can't really do any sort of risk analysis to determine whether we should take the fix or not. Is taking the fix more risky than leaving the code like it was? It's a trade-off that you have to actually decide because, you know, we might have a bug that only one machine out of 10 million hits once a year. But we suspect that the fix for that bug is going to make the machines, every other machine a whole lot more unstable. Is that a good fix to take? It's a risk analysis process. It's something you've got to think about. That's not ever considered in any programming textbook, computer science textbook at all. So to look at this problem as a programming problem is to miss a whole lot of what we do and what we should be doing. Design, build, break, repeat. We have to assume that the design and the implementation will break at some point. Simply because we cannot keep everything in our heads. Programming requires that the end result to be perfect, that we can prove that it's not going to break. But we can't do that. Programming also assumes the hardware is not going to fail, which we can't do either. So really when you look at it, it comes down to a simple thing. Requiring perfection where perfection is not possible or is not required is actively harmful. And that's basically why the talk is told. Programming considered harmful. We cannot reach perfection. Thank you.