 Okay, it is time, so I am very, very proud to introduce two very, very good friends of mine, Sergei and Meredith. Guys, give them a huge round of applause. Thank you so much. So this talk is about the science of insecurity, by which I mean the entire spectrum of exploits and vulnerabilities considered as a systematic, repeatable, and most importantly predictive mathematical model. We're going to examine from first principles what it is about exploits that makes them exploits in the first place, and how we can use this systematic understanding to design and implement software in which, to borrow a turn of phrase from Dan Kaminski, entire classes of bugs simply don't exist. But before I get going, I want to remark on the other talk this Congress that focuses on Turing machines. Corey Doctorow's talk yesterday on the coming war on general computation. You're going to hear a lot in the next hour about certain hazards of Turing complete protocols, and I need to make clear that what you're conveying against is Turing machine computational power in very specific places, namely the communication boundaries between Turing complete systems. Your CPU needs to be able to perform arbitrary computation. ICMP Echo does not. So that's an important distinction, and please keep it in mind. But more important than that are Corey's spot on observations about how the sausage gets made, how lawmakers and vendors conspire to herd users into walled gardens where, oh, by the way, the folks doing the herding can lock out competitors and bleed those users to their heart's content. It's the oldest game in the book, and it's already underway in the United States. Right now there's an initiative under development called NISTIC, the National Strategy for Trusted Identities in Cyberspace, which is really just the old carrot and stick game aimed at conning citizens into voluntarily giving up any possibility of anonymity online, which is really the same as free speech online when you get right down to it by holding out the carrot of, quote, unquote, safe walled gardens like the iOS app store and social networks like Google+, where the price of admission is your offline identity, backed up with a stick made from the specter of spam and malware and evil cyber criminals on the filthy, nasty internet. And no matter how rotten the carrot really is, the thing about human psychology is once someone has bought into lofty wireless promises about matters such as security, it becomes really hard to convince them that the carrot cake is a lie. So our other option is to break the stick, and that's what we're going to talk about today. Now, I actually owe this observation to my husband, Len Sassiman, who passed away back in July. Len started out in this world as an anonymity and privacy researcher. He was working on that at Kailoven under Bart Pranell. But in 2009 he shifted focus to the language theoretic security work that I had started back in 2005, because he became convinced that the future of an open internet is completely dependent on us smoothing out the attack surface and taking away the ability of repressive governments to hold this thread over our heads. All right, so if all you take away from this talk is this, I'm going to be really, really happy. First of all, the vast majority of vulnerabilities come from protocol and message format designs that require you to solve problems that are really not solvable if you want to process them securely. You know, putting yourself into a catch-22 that you cannot escape from. When you try to set yourself up against the laws of physics, you fail. End of story. But the good news is that from a design perspective, we can route around this. We just have to think about how we design inputs and outputs in terms of formal languages and conduct our implementation accordingly. So we have been living on borrowed time. When the NSA tells you that there's no such thing as secure anymore, they're probably not kidding. About this time last year, Brian Snow was saying there's a security meltdown coming, and I think the last year pretty well bears out what he had to say. And this is not for lack of trying. I mean, there have been trustworthy computing initiatives dating back decades. If you go to any bookstore, you're going to see scads of secure coding books on the shelves. There are dozens of conferences dedicated to secure software engineering and theoretical security. So much ink has been spilled on this, so many bits have been spilled on this, and software still sucks ass. And don't even get me started on hardware. We have the internet we have, and how did we get here? Is everyone really just that incompetent and has no idea what they're doing, or is something else at work? Well, it's a little column A, a little column B. So what is it that we're doing wrong? Okay, I mentioned the laws of physics earlier, but really our physicists are people like Bertrand Russell, Kurt Girdle, Alan Turing, the guys who tried to universalize mathematics from axioms. Now the problem that we as security researchers want to solve is more focused than the general problem that they wanted to solve, but we start in the same way, which is by formalizing the question, what is insecurity? Is it holes for sneaking in executable code? I mean, that tends to be how people look at it, but that's looking at everything in isolation. Furthermore, you don't necessarily have to have an obvious hole in order to execute, say, a return-oriented programming attack. Memory corruption, buffer overflows, in-band signaling, like Travis Goodspeed's packet-in-packet stuff, capabilities issues, access control issues, all of the above. None of that is really sufficiently general and also sufficiently descriptive. I come from linguistics as a background, and we have this concept in linguistics of elegance, where an elegant description of a language is one that generates exactly strings that belong in the language, sentences that belong in the language, and none of the ones that don't. Now, if you think about those possible causes of insecurity that we just looked at, Wikipedia is not much better. This laundry list here is just all over the map. And if the classification looks arbitrary, what this is really telling us, this is a lesson from the natural philosophers here, if the classification looks arbitrary, this means that we really don't understand the structure and the common origin of the phenomena that we're seeing. So Jorge Luis Borges has this great description of the classification of animals, those that are belonging to the emperor, those that are embalmed, tamed, suckling pigs, sirens, fabulous ones, stray dogs, those that are included in the present classification and bonus points for anybody who gets what paradox that is. You know, again, there's no system to this, there's no rhyme, there's no reason to it. What we need is a way to go from the arbitrary Lamarckian classification where whales and hoofed mammals are sitting there in the same clade under reptiles of all damn things and move to an understanding like Watson and Crick's understanding of DNA that has led us to the science of cladistics where we treat where we treat species as classes based on common descent. It's true that we can and we should classify exploits by similarity but if we only look at the surface similarity and not the underlying structural similarity, we're setting ourselves up for failure. All right, so what does trustworthiness actually mean in a computing system? Well, Tony Hoare, the guy that came up with a little algorithm called Quicksort a while back, remarked in a paper in the 1960s that a program is correct if and only if it handles exactly the inputs it's supposed to handle and produces exactly the outputs it's supposed to produce and nothing else. We're not that good at this. We're not that good at deciding whether an input is valid or malicious and rejecting the malicious ones. We don't have a good idea of how to trust software not to do certain things because how do you predict the unpredictable? Exploitation is simply unexpected computation caused by crafted inputs. So when you're asking the question is it a good input or is it a bad input, well, I mean, this is not a theoretical problem. We've been talking about this in computability theory for, you know, 50-odd years now. We have this concept of undecidable problems, problems that you cannot solve regardless of how much computational power you have and some undecidable problems have to do with recognition of inputs. No general algorithm for solving these problems exists. So the networked world that we live in is a composed world. You have individual components that accept inputs, must accept or reject those inputs and then generate some outputs and these components communicate and it's necessary for these messages to be interpreted identically by the end points that send and receive them because, you know, if Alice thinks she's telling Bob, you need to go to the store and Bob hears you need to go to the bus station, Alice isn't getting whatever Bob asked her to get or, yeah, Alice isn't going to get whatever she asked Bob to get her. So we're going to talk today about, you know, how the halting problem arises in both the single component case and in the distributed case. All right, so how does undecidability apply here? Well, in the single component case, some protocol formats are sufficiently complex that simply being able to discern a good input from a bad input is undecidable and furthermore, some protocols are so complex that determining whether implementation A and implementation B handle them in the same way is also undecidable. All right, so we talk about recognizers as algorithms that examine an input string and determine whether that input string belongs in the language or not. For subturing classes of languages and we'll get into what those are in just a minute, the answer is either yes or no for turing complete languages, the halting problem says the answer is yes, no, or maybe. And if the answer is maybe, you're never going to know because the algorithm's never going to stop executing. So when input recognition fails, the code on the inside of your protocol, the state machine of your protocol, is going to receive something that wasn't expecting. Any primitives that it exposes can be programmed essentially with input from an attacker to trigger memory corruption or implicit data flows, and a weird machine emerges. Talver Flake coined that term to describe a more powerful execution environment than is intended, which, oh, by the way, you can program. So where do they show up? Well, the vast majority of programs that handle inputs scatter their validation checks throughout the entire program. The one exception to this tends to be compilers. And when your checks are scattered throughout the program, it's literally just like you took your 12 gauge and you fired a load of double-ot at your code and wherever the pellets landed, that's where your phones are. This is a horrible design pattern. Vaguely understood input languages are the mother of O-Day. And a weird machine is born. An adorable little root shell just pops right out. So as Talver said, exploitation is setting up, instantiating, and programming a weird machine. One of those holes that the programmer inserted when he aimed his shotgun gets blitzed by some crafted input, and the state that the internals of the code enter into were not intended by the programmer, but you know what, they're there anyway. So if we're going to understand this systematically, we have to look at this from the basics of computation as described originally by Alan Turing and Alonzo Church, but we have to think about it with regard to the lessons that we've learned from exploit programming. People like me study models of computation. People like Halvar and FX and all of our brilliant exploit guys study actual computational limits of real systems, and the great part is we meet in the middle. So the Turing machine was the mathematical model that Alan Turing came up with to study the limits of what it is possible for a computation engine to perform. And he was able to formalize a class of problems for which no Turing machine can solve them. The initial example problem that he described is known as the halting problem, because it turns on the question of whether a given Turing machine will halt, which is to say return yes or no, or end up in that infinite loop maybe state where you never get an answer back. Because even if the answer is yes or no, you have no idea how long you're going to have to wait. And if the answer is maybe, that answer is forever, but you're not going to know. So if you say, I can take a universal Turing machine and I can execute another Turing machine on it, and that universal Turing machine is going to decide if the input will ever terminate, fail, does not happen. If someone tries to convince you that they can do that, they are either stupid or malicious. Unfortunately, some designs, fortunately more of them in the composed distributed system model, where you've got two different implementations, end up reducing to an undecidable problem, which means that there really is no way to fix that problem as it exists. We can avoid it, but fixing it requires a redesign. But there is no 80-20 engineering solution for the halting problem. If somebody is trying to sell this to you, then run away. If someone bothered on your behalf, fire them because you don't want to be around when it breaks. And this is counterintuitive. We're used to seeing more effort improve the results, but at the end of the day, you don't want to be the person who ends up slaving away for the rest of their life, trying to just pour more and more and more and more effort down a bottomless hole. Okay, so a little bit more about the history of halting, just to sort of give you an idea where we're coming from. So back in the 17th century, a guy named Leibniz, you might know him as one of the inventors of calculus, asked the question, is it possible for a machine to determine whether an arbitrary mathematical statement is true? And he kind of worked on that in isolation during his life. People thought he was a bit of a nut. And then Hilbert poses this again in 1928 in a list of problems that he believed if we could answer all of these problems, then we would have a complete understanding of mathematics, and we could call mathematics done and we could move on. So church and turing work on this independently for a while, and both of them come up with the answer. Nope, sorry, it doesn't work. Now what's interesting about... Another interesting thing that falls out of this is a thing called the Curry-Howard correspondence, but I'll talk about that in just a second. Conceptually, it's easier, actually, to think about the halting problem in terms of a thing called Russell's Paradox. So back on November 2nd, there was a general strike in Oakland and elsewhere. And Quinn Norton, who's embedded with... She's embedded with Occupy right now, writing for Wired says, huh, well, you know, I want to go to the general strike, but if I go, I'm going to be working. Oops. This is Russell's Paradox in a nutshell. You know, normally the normal way this is phrased is also called the Barber Paradox, where you've got a village where the barber shaves every man who doesn't shave himself. So who shaves the barber? As a funny side note, by the way, the smart-ass answer to this one is, well, let the barber be a woman. And that's actually not such a wise-ass answer after all. That's actually pretty clever, because it means you're introducing types, and that actually renders it decidable. If you're into that kind of type theory, come talk to me afterwards, because I want to meet you. All right. So I mentioned the Curry Howard correspondents earlier. Programs are proofs, and vice versa. This means that an exploit is also a proof, and it's the best kind of proof, because it's a proof by construction, and those are really fucking easy to understand. We're working on establishing a formal duality of this so that we can, like, convince the academics of it, but, you know, part of why I love proofs by construction is that you can basically just look at them and go, oh, that's obvious. That's easy to show people how that works. All right. So as we said earlier, inputs are a language. Some languages are harder to recognize than others, and for some of them, recognition is undecidable. So what kinds of languages do we even have? Well, there's a hierarchy. Noam Chomsky came up with it. So the simplest class of languages are the regular languages. You might also know them as regular expressions or finite state automata. Then there's the context-free languages, the context-sensitive languages, and the recursively innumerable languages, which are, and the recursively innumerable languages are the ones that are equivalent to Turing machines. These categories are hierarchical, and different categories have different properties that we can use to our benefit. So finite state machines just have very simple nesting. They can use delimiters. They can have repetition. So if you look at that graph up there, you start at the start state, and as you accept characters from an input string, you move from one state to another. And if once you've finished all those transitions and consumed all your input, if you're in state seven, then you're good. You've matched the input to your language. If not, then it rejects. Trying to match recursively nested structures with regular expressions fails, which is why when you try to use regular expressions to parse HTML, Zalgo comes out of the walls and eats your soul. If you want to match recursively nested structures, you can do this with what's called a push-down automata, which you obtain by taking a finite state machine and adding a stack to it. And then you can balance parentheses to arbitrary depth and everything's golden. But there are other properties that we see in protocols that don't fall into this category. If you have a protocol with a length field in it, like, oh, I don't know, HTTP, TCP, pretty much every protocol out on the Internet just about. With the exception of ATM, that one's regular, which is kind of awesome, but anyway. Those require a context-sensitive grammar. You know, if there is some metadata that is necessary in order to interpret the rest of the data, then that's when your context-sensitive. And finally, your turn complete if you are saying, is this input some program that produces some given result? That's undecidable. A guy named Rice formulated this in the term in what's called Rice's theorem, which reduces to the halting problem. All right, so our language hierarchy again. What this tells us is we need to stop weird machines. And we can do this by constraining input language strength to context-free or regular. Is this all about parser bugs? Well, no, but that is an awful lot of it. If you are a component of a program and you receive an input from something, you have a recognizer whether it was formally written as a recognizer or not. It could be a shotgun parser, and in that case, it sucks to be you, because if the recognizer doesn't actually match the language, then it's broken. And if the language itself is not well-defined or understood or if it's under-specified, then one, the program is broken. All right, and when I say languages, I mean quite literally every formal system that we use in software. I'm talking about network stacks, because the format of a packet is a language. I'm talking about web servers, SQL servers, et cetera, because there are requests for my language, but it gets better than that, because even just on a single system, your memory manager, the heap that assigns memory, that's a language. The call graph is a language, and that one likes to be context-sensitive, it really does. And most of these language recognizers are hand-rolled, and that's bad. Implicit recognizers, bad recognizers, because you cannot efficiently and effectively test or debug. When you're intermingling your recognition and your processing, like if you just recognize a part of it, and then if you just recognize a part of an input string and do something based on that, and then recognize another part of it and do something else based on that, you're gonna end up with data flows and transitions that give you a weird machine. It's running on state that it borrowed from the rest of the program. So, if you haven't recognized it, don't goddamn process it. If you have not confirmed that an input string is actually a member of the input language you want to process and you try to process it as one anyway, you open yourself up to letting an attacker program the weird machine that's lurking inside your implementation. So you need to know and be strict about what your input language is. And this should be easier than people seem to think it is, because, you know, we have our FCs. They have convenient grammars in nice shiny BNF notation in the appendix. We can and should generate these automatically rather than hand coding them, because really, who's got the time for all of that? We're programmers. We're hackers. It's our job to be creatively lazy. So it's important to know how much computational power you need to recognize a given language so that you don't do dumb shit like trying to parse a nested structure with a regex. But once you've done that, you can generate it and then hand the resulting parse tree off to the internals of your code and let the rest of you and let your code operate over that parse tree. But please, no more Turing complete input languages. Don't let the Turing beast devour her safe computing future. So as I've said before, the regular languages are a safe place to be. Regex syntax is terrible, but people understand it for the most part. It could stand to be better, but people understand it better than they understand everything else. You know, part of the problem there is that we just don't have really good tools yet for handling higher classes of languages. People are still kind of starting to wrap their heads around parser combinators and really nobody likes Bison. But the tools are out there and we're working on making better ones. But where this matters really is in design. We must reduce computing power greed and keep the Turing beast from eating our investments. So most of what I was talking about there previously had to do with the single system case. Now let's talk about systems that are communicating between each other. Is what you're seeing the same as what they're seeing? The story doesn't end if there's no shellcode chestburster. So across interfaces is where we need to minimize the complexity of parsers and recognizers. So like I said in the beginning, it's the communication boundaries that we really care about. It is necessary for parsers that are involved in a multi-system exchange to parse those messages in exactly the same way because if you don't, then it becomes really easy for an attacker to generate something that one system regards in one light and another system regards in another light. Now you might be saying, oh, but this all sounds so theoretical. Nope, sorry. So two years ago, Len and Dan Kaminski and I beat the hell out of X509 in pretty much exactly this way. And we came up with about 20 different ways that you could generate a CSR, send it off to a CA that's running some particular version of X509, get back a certificate, show that to a browser that is using a different X509 implementation. And that browser believes that the certificate you got signed is a valid certificate for a domain you don't own. You can take a look at our talk from Black Hat. I think it was 2010. Yeah, exploiting the forest with trees for more details on this one because it's full of lulls. But really, this is the halting problem again just in a different form. I mentioned that Rice's theorem reduces to the halting problem. They are functionally equivalent. There is another problem that reduces to halting called the context-free equivalence problem. So the context-free languages fall into sort of two subclasses. You have the unambiguous context-free grammars, also known as the deterministic push-down automata, and the ambiguous context-free grammars, which correspond to the non-deterministic push-down automata. So if your language is regular or deterministic push-down, then determining whether two implementations are actually doing the exact same thing, that's decidable and that's fine, and you can actually automate that. But if your ambiguous context-free or stronger, this problem is undecidable. And we run into this in IDSes, too, which is kind of wacky. You know, if your problem is that you have a shotgun parser in your code, and you say, well, fine, we'll just throw a less vulnerable component in front of it, then all you've done is move the problem to another layer because instead of your target code and a possible attacker speaking different dialects of the same protocol, as it were, now it's just your IDS talking to it. So you haven't really helped any. All you've done is just moved it around. And this is research that people have been doing for a while. You know, Potato Chick and Nusham in 1998, Vern Paxton in 1999, they've all observed this. They just weren't thinking about it in terms of the broader computability picture. But that's okay. That's what academics are here for. So once you've created a computational automaton of a particular strength, the genie is out of the bottle and there is no going back from that, the dark energy will resurface elsewhere in your code. You haven't solved the halting problem. You've just moved it into a different instance of the halting problem. So when you are designing protocols, it is vital to choose the simplest possible input language, preferably regular, but certainly no stronger than deterministic context-free. And this sounds kind of intimidating, right? But Meredith, you said that length fields make a protocol context-sensitive and doing that comparison is undecidable. How can we survive without length fields? And my answer to that is we have these nice things called S-expressions. If you can bound a message with delimiters, that's still deterministic context-free. You can get away with this. It requires different approaches than the ones that we have gotten used to in the last 30 years of internet protocol design. But this is okay, because the other tools that I'm talking about have also been around for even longer. But at the end of the day, the crucial thing is we must have computational equivalence at all protocol endpoints. So, looking back at the very, very, very early days of the IETF, there's a thing called Postel's Principle, which simply stated is be conservative in what you send, be liberally in what you accept. Now, I don't mean to bust on John Postel unnecessarily. I mean, Postel's Principle was absolutely necessary at the beginning, during the early days of the internet, because we were figuring all this shit out as we were going along. And we needed the flexibility that Postel's Principle afforded us. But the problem is that people kind of misread be liberal in what you accept as either be lazy in what you accept or don't worry so much about being kind of a dick and changing the protocol and sending other stuff, because other people have to be liberal about what they're accepting, and so they'll just accept our crap anyway, Microsoft. So, we're proposing that Postel's Principle needs a patch and it looks a little something like this. Instead of be liberal about what you accept, we say be definite about what you accept. Treat inputs as a language, accept them with a matching computational automaton, generate the recognizer from the grammar of the language, treat input handling computational power as privilege and reduce that privilege wherever it is possible to do so. All right. So, the takeaway for today. Do not let your protocols grow up to be turn complete. People will try. People will tell you, oh, well, we need these special cases. We need to handle this weird corner case. We need to keep bolting on other stuff to make other things possible. No, disbelieve the illusion that down that way lies madness because if your application relies on a turn complete protocol, it is going to take infinite time to secure it. Don't mistake complexity for functionality. It's an easy mistake to make, but don't let somebody sell you on it. If somebody's trying to sell you on saving money on future upgrades thanks to extensibility, take a second look at that and make sure that you're not going to lose money on security and mediation and remediation because the Turing beast has come and bit you in the ass. If somebody tells you that, oh, well, you know, this system is like totally extensible and updatable because it embeds a scripting language and data, run, run, run, run, run, in band signaling, bad, very bad. And the practical value here is that you're saving money and you're saving effort. And you can also expose vendors who are claiming that they have what amounts to a solution to the perpetual motion problem. This helps you choose the right components in order to have security that you can manage. It also helps you avoid system aggregation and integration nightmare scenarios because if you know that all of your systems have speak composable protocols and all of the systems that you're going to be composing them with have composable protocols. It all fits together like Legos. So our approach helps people to save misinvestment of money and effort, expose vendors that claim security based on solving perpetual motion. You stuck this slide in there twice, Sergey. Sorry about that. All right, I'm going to skip this one. Wow, you put it in there three times. That was kind of great. All right, so again, stop weird machines, no more Turing complete input languages, reduce computing power greed, ambiguity is insecurity, full recognition before processing, computational equivalents for all protocol endpoints, and context for your regular bitches. There are your slogans. Have fun with them. Thank you. Okay, thank you. We have now comfortably time for large Q&A sessions. So if you have questions, please line up at the two microphones in the alleys. And I think we can start maybe here on the left side. Hi, do you have any examples of languages that do this already? Sorry, not languages, but implementations that do this already that use something regular to transfer between two components? Well, okay, like I mentioned earlier, there are not a hell of a lot of existing protocols that are only regular, which is a thing that needs to change. I did mention earlier that ATM happens to be a regular message format because all packets are of a fixed length, which is a handy thing. If all of your strings are exactly 53 bytes and only that, then you can actually specify that with a finite state machine. That said, I don't know of any implementations that use a regular expression to parse ATM, which is sad. Okay, now a question from the mighty Internet. Yeah, Krokodilirian from Urk asks, if there are any tools that generate parsers in different languages than BNF? Right, so the question was, are there any tools that generate parsers in different languages than BNF? And the answer is yes. So I mentioned parser combinators earlier. So these are, well, they're both a mathematical abstraction and a practical library in a butt-ton of different languages that can actually parse some of the context-sensitive languages, which is really neat. So you know how when you're building a regular expression, you're essentially just banging smaller regular expressions together? That's essentially how parser combinators work, but they also admit recursion. So, yeah, so Don Lyon invented... I totally botched that pronunciation, I'm sorry. But yeah, he invented them back in... It was only a couple of years ago, actually. And so they initially showed up in Haskell, but since then they've crept into Scala, I think also Clojure, Python, JavaScript. So the tools are starting to creep into more and more... into more and more commonly used languages. If you're interested in learning about parser combinators, I really recommend checking out the parser combinator standard library in Scala, because it looks a lot like Java, and there's a lot of really good material out there on using Scala to write embedded DSLs using parser combinators, and the tutorials out there are way better than anything I could give you in five minutes here. Over there? Hi. So you talk about the halting problem and chewing complete input languages. Yes. But I think pretty often we don't care about is it going to take finite time to handle this input, but we care is it going to take a reasonable time? And even with regular expressions, there are certainly plenty of regular expressions there, and a regular expression library implementations have cases where they will take exponential time. Sure. Yeah, classical cases, yeah. So... Well, so my bigger concern... How do you avoid that sort of problem? My bigger concern, yeah, tractability obviously a huge problem, but really my bigger concern about Turing complete input languages is the programming the weird machine problem. So there was this hilarious result on GitHub a couple of months ago where somebody proved that HTML5 and CSS together... HTML5 and CSS3 together but without JavaScript is Turing complete. He did it by implementing rule 110 for cellular automata using only HTML5 and CSS. It is brilliant and demented. And the fucked up thing about it is that you could use that construction to do arbitrary computation in what is supposed to be only a markup language and its associated display components. I mean, how fucked up is that? Hi. I was wondering if you could give an example around these length fields because I think that's not that easy if you're trying to build something that can transport arbitrary user strings. I've run into some interesting CSV files lately. CSV is probably... can probably be implemented with a regular context-sensitive grammar, but you run into interesting things when you have arbitrary input strings in your CSV that can contain, for example, a line break or it can contain commas or whatever. Right, right. And you've really just hit on it right there because a CSV message is comma-separated values. And if those values are also allowed to contain commas, then you've got an in-band signaling problem where how is the interpreter supposed to determine whether that comma is part of a value or whether it's a delimiter between values. And, of course, the off-the-cuff answer to that one is well-escaping, but, yeah, I'll just refer you to the last 20 years of SQL injection with regard to that one. No, so I mentioned desk expressions earlier. Yeah, those come from Lisp. Land of parentheses. And so the idea there is that you just have some set of delimiters that are not part of the sub-language that describes values. Because, like, you know, what... So when I think of a length field, I think of a value that tells me how many bytes to expect. It doesn't really hurt you to say, okay, well, rather than saying, all right, expect 42 bytes and then count the next 42 bytes, if you instead have opening delimiter, consume, consume, consume, consume, consume, until you see closing delimiter and you're done. You know, as long as those opening and closing delimiters are not part of the sub-language that describes what appears between them, you know, you can get away with this. Does that answer your questions? I'm not sure if it does. Because I think that's just my point. It's easier, in some cases, to just have a length delimited field instead of building this parser with all the escaping just to be able to handle arbitrary user input. Simply because I've run across, shall we say, broken and badly defined CSVs in the past. Sure, sure, absolutely. And again, you know, badly defined things. Looks like Surya actually has some input. Come on up here. So, you see, you're right. It's counter-intuitive. Why would I bring a parser if I can skip over so many bytes, right? Why should I implement the proper escaping discipline and the proper scanning and parsing if I can just skip some bytes? But all bytes are really created equal. And when you're seeing these bytes, when you're seeing these bytes, you don't know which ones are data, which ones are metadata. Packet-in-packet is based on that. Packet-in-packet is based essentially on the really simple machine within the digital radio chip. It's kind of sort of giving you so many bytes for your frame. It turns out that despite the simplicity, it doesn't work so well because you mistake your data bytes for your metadata bytes and the whole scheme, the whole encapsulation, goes out of the window, right? So this is basically the science part of it. There is a thing that seems simple but is actually fraught with danger. And you need to do a more complex thing, but that's actually principled and can be proven to be principled, can be proven to require less computational power in the end. Think of parsing an IP packet. Or worse, think of parsing an IPv6 packet. Right? Imagine your pointer pointing into that packet. Do you know what you're looking at? When you're writing the internal code, do you know which properties about the packet have been validated already? And whether your assumptions, when you're operating off of that pointer and happily casting into your structs and doing something with the values, do you know which assumptions are actually truly trustworthy given the previous sanity checking or not? You don't. You start in this land of mixed data and metadata and you get the mess that we're dealing with basically every day. The reason for this is protocols that use the simpler thing, such as quote-unquote simpler thing, such as length fields. You are starved for context when you're in your internal code, when you're supposed to have checked all the sanity, all the various assumptions about whether this is good data that needs to be trusted and not checked at every turn. Because you can't really check data at every turn. Where do you stop? Right? Also just one quick addition to that. The exception to the context-free equivalence problem for, say, context-sensitive languages with length fields in them is as long as you're generating your implementation from the exact same grammar, if two implementations generate their input handling routines from the same grammar, then those are guaranteed to be the same. Modulo, the compiler fucking you. Sorry. Yeah, remind me later. All right. How much time do we have left? We have another 10 minutes, so we can take another question from IRC. Yeah. First of all, I apologize. This was a parser fuckup. Apparently the question about BNF I asked before was not other things than BNF, but generating parsers in different languages like C from actual BNF representations. Oh, I'm sorry, I misunderstood you. I misunderstood the question. Your standard C tool for that is Yak, yet another compiler compiler, or the GNU version, which is Bison. And the input language that both of these tools use is literally just straight up BNF. Okay, so somebody mentioned that some work apps require exponential time to execute. This is only true if you either use a backtracking implementation or add features like backwrapper events, so plain work apps don't really have this problem. And the other thing about backtracking regular expressions is that like PCRE regular expressions is that they have a stack, they're actually context-free. So, yeah. Yeah, thanks for pointing that out. But you see, it's not really the running time that should concern you. It's trying to solve an unsolvable problem when validating input. Yeah, I mean, the real problem that we want to solve is let's stop exposing weird machines because that way we cut the thread of malware off at the knees. You must make sure that when data reaches the processing logic, it has been checked and you're fully certain what is it that you are operating on. If you don't, you've got the explosion of state, you've got probably transitions that you did not anticipate, and you've got that computation, that unexpected computation that is otherwise known as being pwned. So, when we say that an exploit is proof, it's proof by construction that a computation is possible in the actual environment that you're exploiting. You know, you can't argue with proofs. This is science. Yeah. When you try to argue with reality, reality wins. Okay, another question from the internet. Well, Yuv Gang Yu asks, what your opinion is on the trend to invent ad hoc proprietary protocols between JavaScript apps and browsers and their servers? I'm sorry, can you repeat the question? What your opinion is on the trend to the trend to invent ad hoc proprietary protocols between JavaScript applications and browsers and their servers? Oh, so you're talking about, like, you know, RESTful APIs and stuff like that? Yeah, so I'm not a huge fan of proprietary protocols in the first instance. Yeah, you know, I can't stop people from shooting themselves in the foot. I can advise them to put the gun down. But I can't stop them. Well, you should realize that before RESTful was invented, CGI was an even bigger mess. You know, whatever the hell was in that URL, how would you even start making sense of it? Which things there were objects, which ones were methods or actions or messages, at least RESTful gives you an opportunity to start structuring that. But, yeah, we're not looking forward to the Turing complete web application future. I can't stop them. Okay, so we're doing one more question. Okay, then the more important one. Several people asked how about using, how it is about using length fields for performance, like in Travis Talk and the other one. And then delimiters for certainty. I missed everything after performance, sorry. Okay. I'm half deaf, I'm bad. Okay, using in protocols, using length fields for performance, and then additionally delimiters for certainty. So, I think performance is really a red herring. For that, I want to refer you to Matt Mite and Guy Hoos' first name. I don't remember Darius' paper from 2010 called Yak is Dead, where they blow up the myth about parsing being, like, you know, impractically heavy and so on and so forth. You know, I have not seen, like, real-world runtime statistics, comparing, say, context-free to context-sensitive with length fields, but we could gin that up. And I'd be willing to place a bet right now that context-free will actually win on performance. Also, consider that by including both the length and the delimiter, you've just created a semantic ambiguity, because you see, what do you think the implementers would do? Some of them would go by the length field, others by the delimiter. You know, and you'll have the wonderful world of ideas evasion all over again. By the way, if you were there for Dan Kaminsky's talk, the conclusion of it, the tricks to reveal your ISPs doing computation you don't want them to be doing on your packet, exposing the fact that they're doing this computation. Well, you know, this is the old world of ideas evasion, writing again, showing that if you don't have computational equivalence, it's damn hard to hide it. And it's easy enough to expose it. So don't rely on ambiguity for security, because ambiguity is insecurity. Okay, thank you. Give a huge round of applause for this awesome talk, please. Before you all leave, I heard stories about your lab coat. So could you come here, please? Can we please close all doors and put the house lights down? So because we have something to show here. Alright, so my dad is a chemical engineer. He worked for Exxon his entire career and before he was born, he spent some time in the research lab there. And when my mom, who I visited my parents for Christmas in Houston, and when my mom heard about the Odeo Lab Coats thing, she was like, oh, cool. And when I dug my dad's old lab coat from the research lab out of storage, the one that I used to play dress up in when I was a little girl. And I was like, okay, well, this is CCC. It needs to have art in it. So I decided to go ahead and bling it out a little bit with some UV decorations. And I guess they're working on the light problem there. So yeah, this is not the very best lab coat in the world. This is just a tribute. And I want to thank Dan and Travis again for putting this art in the Bitcoin blockchain because it really does mean the world to me. Thank you. Okay, lights back on, please. Yeah. Okay.