 Hi there. I'm going to talk about tank tracking. I'm Mark Shannon. I'm a lead engineer at a company called Semmel. We do code analysis. We have a free to open source project called LGTM.com. I have stickers if anyone wants any. I'll show you this during the course of the talk. Schedule is, first of all, you say, what do we mean by taint? It's an odd term, slightly Victorian sanitary implications. I'll come to that later. There's tank tracking and tank checking, which I'll compare. One is a static analysis approach and the other is a sort of a damic built-in-gear code approach. I'll explain what tank tracking is and I'll give you a demo of our stuff showing how it works. I'll go to explain briefly what tank checking is and how we could implement it in Python. I don't have any sort of code to give you, but just trying to get you sort of ideas that you might want to use later on. You'll need to add checks for if you're doing with anything where you're interfacing with the outside world. Once we've gone through this stuff, it will hopefully become apparent to you where we should add these checks, but we'll discuss that briefly and then I'll just summarise a list of things for you to remember. What is taint? Taint basically just means anything you can't trust. I'm not sure where the word taint comes from. It's one of those things that someone once started using it and then people keep using it and it's now the conventional term. It's just untrusted data, basically anything from the internet, but it could be, if you're allowing clients to upload files, for example, it could be the contents of any of those files, so you might not even trust your own file system in certain cases. You could be tracking a user ID and you could regard users who aren't admins as tainted because there's certain things they're not allowed to do. It could be a number of things. There we go. Taint tracking. We've got this taint. Some of our programme is tainted. What does that really mean? It means that there's untrusted values in the variables in various parts of our programme. We're interested in seeing how that could end up somewhere we don't want it. There are three components to this that we're interested in. There's the sources, which is where the taint comes from. I said that was probably the internet or something similar. There's the sinks and that's where we don't want the taint to end up. It could be a sequel query. It could be a vowel or a exec. It could be a path opening a file because we don't want people opening the password file. It could be, again, any number of things. The last thing is what's called a sanitiser. Again, Victorian morals claiming less next to godliness terminology. I don't know where this terminology comes from. A sanitiser is just something that cleans up your data, makes it from untrusted to safe. For example, for a sequel injection, a sanitiser would just be something that did sequel escaping. We'll come to that in detail in a minute. The key thing here is that this is a code analysis, so we're not actually running any code. What we need to look for are potential paths in the code where this taint could get from an input, a source, to a vulnerable point in our code, a sink, without passing a sanitiser. In other words, it's still in its unsafe form. Sources. I think I've outlined some possibilities already here. It's literally anything that, in some way, some malicious entity could have put something in that you don't want. It's anything from the internet, basically. But there might be other cases. If you're a big institution, you may trust some of your employees to do certain things, but not trust other employees to do certain things. So input, even potentially for our own employees, could be right as taint. Particularly financial institutions can be very sensitive about who can access what data, or medical institutions, or anything like that. I'll cover this later, but anything that you cannot trust. Or a point in your code where something you can't trust enters that code. I think that's the key thing. So, for example, if you're using Django, you'll wire up your views in Django, and you'll have a function that's wired to a URL, and that will take a request parameter. That argument there, that parameter, is a source of things you do not trust. There's nothing wrong with Django. It's not that there's a fault with Django. It's the case that that is the entry point into your code where the outside world is sending you stuff. Sinks. They're the places where bad things can happen. Sequel injection. So who's heard of Sequel injection? You've all seen the XKCD cartoon. There are other forms of injection. There's path injection, which is where, instead of a sequel query, it's just a path into the file system, and someone could put double dots in there and raise it out of the scope you're interested in, the safe part of your file system you don't want them accessing to. Code injection, a remote code execution, that's usually the one where you get into the newspapers, and there's a variety of these things. So these are generally called injection attacks, and these are the headline versions of this, but there's other forms of attack or vulnerabilities that taint tracking can cover. So sanitizers. Well, what is a sanitizer and what isn't? Rather depends. So there's no sort of general thing of this is a sanitizer or this isn't. If you HTML escape your Sequel code that it strings, you have a string and it somehow ends up in a sequel query. If it's been HTML sanitized, it's not safe. Vice versa. What's required to sanitize a sequel query is not what's required to prevent cross-site scripting, which is the injection form here where we've got the input is a request and the output is a response. Okay, so a simple example, code injection. I've chosen this one because this is the actual practical example I'm going to give you slightly later. And it's the simplest. And sanitizers are said often inherently slightly more complex and fiddly to define. So this is a fairly simple one to define. So any HTTP request parameter, so Django, Flask, anything, pretty much any of this stuff. If there's a parameter called request or Req, good chance that that's something we want to be wary of. And exec or a val are our sinks. You probably think, oh, you should never use a val because it's unsafe and security. But sometimes you have to use it in limited scopes. And sometimes you need to do it from user responses. So I've said there's no sanitize in general, but there can be specific cases. So, for example, we may need to do exec a certain small set of commands. So the way we can sanitize it is we can whitelist inputs and then map those to our commands. And that acts as a sanitizer. That mapping effectively sort of blocks the user input. And if it doesn't match up any of our whitelist, we'll raise an exception or handle it in some other way. And yeah, it's not just injection attacks. There's other security attacks. There's IDOR attacks, which is an insecure direct object reference. And that's kind of an in-memory equivalent of a SQL injection. That's where you have in-memory data that's indexed by, say, the user ID. And you haven't done a check, and it allows someone to paste in a user ID that isn't their own into a URL and get information back about another user. So the SQL injection is one way of doing that, but indirect memory references are another. And there's resource leaks. This doesn't have to be a security stuff. It could just be like losing file descriptors or losing sockets. It's not a security problem, but it's still pretty annoying when your program crashes because it runs out of file descriptors or run out of sockets. So there's resource leak issues here. In this case, the sources are obviously where you create one of these things. And the sync is possibly something a little more subtle in it. It's where, in general, it escapes your reference. So if you're creating... If you create a file handle or a socket in a with statement, it's guaranteed to be cleaned up. But we can't always do that. Sometimes you have to create it and it gets passed around a bit. And it might get lost, effectively, or end up indirectly referenced by some other object and retained. So we can use it in that circumstances. And anything else for this pattern? I mean, use your imagination because people out to get you will use theirs. Okay, now this is the demonstration because it's interesting because I can't see my screen. So this is our ID for our query language. Okay, I can sort of see over there. Okay, so I'm just going to click on one of these. That's almost certainly not big enough to see anything, is it? Okay, so this is an example code. Now these are clearly nothing security-related here. These are kind of arbitrary sources and syncs. So you probably guess, but everything written in capital in a source is tree as a source. And function sync, it's arguments tree as a sync. So our top one there, we can see is a very simple flow. We assign a source to X and we then sort of sync. We've got more complex examples of flow. For example, you know, flow can flow through functions, out of functions, into functions, potentially through attributes. The potential numbers of flows are essentially unlimited. So what we're trying to model is as broad a set of those as we can without ending up with a decision where it looks like everything in the whole program is tainted and you're overwhelmed with false positives. Our analysis is reasonably good. I'm sure if I can use a mouse. So think this one. So here's our source and here's our sync. And the flow here is through here. So you can see that the flow flow. So we're okay. So we have a function called up and down. It calls the has source, which has a source in it. So the flow is from the source in has source up into its caller up and down and then back into its callee, which is our sync function. I also have the works by the VF at times. Apologies for this button. No, I don't. Okay. What's going on here? Okay. It seems to have lost things. Any eclipse experts in the house? It does occasionally do this one. I can't. Yeah. It's probably the UI and eclipse is a little bit too flexible for its own good. Restore is not restoring to what I wanted at all. Okay. Right. Well, I will show you the paths. No, I do not want that. This is a digly annoying. The window and then you're in the depth. Maybe at least the panels available. No, you should restore to what it was. Unfortunately, there's not a back thing on. Yeah. I just lost the views. That's all. Okay. Window, show view. With a bit of luck, if we get the one we want. What happened there? Why is that? Okay. It thinks it's already showing it, but it's not actually showing it, which is unhelpful. Okay. Up here. Brilliant. Well done. Thanks very much. Here's our little example. There are two paths here apparently. We start here, then it comes up to here's the return value of has source. See if I can do this without staring at the screen. Then that flows into the parameter for has sink. I missed. Clue that and to our sink. The last one. This is another file. I'll just show you something here. Here we have a slightly more subtle thing where we have a sanitizer, but we can bypass it. If the condition is true, we will bypass the sanitizer. What happens here is X is assigned to the source, and if condition is through, it flows through the side function returns arg and then can flow to the sink. We can see that path here. Very carefully for us to do the right thing this time. Here's the flow we just got from our source to there into our function. Back out again. To our sink. These are obviously fairly simple contrived examples. I will also show you another fairly simple contrived example, which is actually not our code. So at least it should hopefully be slightly more convincing demo of nothing else. Switch from. We're already using this one. If I switch the results to this one. This is our code injection query. This is our actual production query. There's a lot of code here that's hidden. But basically, the query is simply find a source, find a sink, and where does flow from? The extra stuff is to do with the paths, which I just showed you. So this query understands the paths. The key part is flows from source to sink. And that basically just does the flow analysis that I've been explaining in that it just follows the steps in the program. We need to understand Python semantics, obviously. And then there's general more general language semantics such as assignments to and from variables cause tracking from an argument to a parameter and so on. There are two things here. One Python 2, one Python 3. I just randomly chose one. And we look at the path here. I just click on these. These are all very simple flow. It just flows from here's our source and it flows on. Now, I don't know if you can read this at the bottom. If you might be able to a little bit. There's one fine detail here, which is what we're actually tracking here is different things in that we start for the request, which in itself is safe. Various elements of a request are provided by, for example, our server URL should be entirely predictable thing. Bits of it, the Django stick in there and so on. What happens is it's actually the query parameters of the bit that the user can influence. So we note the change. We start off with a request and then as we flow through it, it changes from a request as a request again. And then we have a, what's basically a dictionary of a query parameters and then we extract something. And this is potentially our dangerous object because this is a user control value. Okay. Right. I think I'm a little tight on time due to having fun with the clips. So let's leave that and go back to the presentation. Okay. So take checking. Now code analysis is really good. You don't need to change your code. It has a lot of benefits from it. But security of belt and braces approach is often a good idea. So can we do this dynamically? Well, yeah. I'm not sure any web frameworks or any systems do this, but it's entirely feasible. So we have the same thing. We have sources, syncs and sanitizers. A source, again, is our web request. A sync is anywhere code shouldn't get to. And our sanitizers are where the tank can't get past. So they're essentially our escape functions or whatever we do like that. Now, Pearl and Ruby have this built into the language to some extent, but I'm not sure how robust or how general that is. Any Pearl or Ruby experts, please come and tell me later. Okay. So basically what we're looking at is just an object that doesn't really want to become a string unless you do it in the right way. So suppose our Django request arguments had just returned this tainted data thing, which is kind of this opaque thing. Now, anything that doesn't want to show itself as a string is a bit of a pain for debugging. But it does at least give us some security values. In other words, the only way we can get this to a string is explicitly calling its escape HTML, escape SQL method, which means we are guaranteed to have called a sanitizer on it. And also we guarantee to have only called it once because we can't call that method on a string because it doesn't have it. Okay. Right, I'll be fairly quick here. So basically the last thing is having explained all this flow, I'm hoping that you will then think, well, where do we put these sanitizers? Where's the best place in our code? And I hope that having realized that you want to see these sanitizers, you want to sanitize your inputs once and exactly once. So you could do it exactly at the input or exactly at the output. Doing it in the middle makes it too unreliable. But the problem with doing it at the input is you don't know what you're sanitizing for. I mean, you might have an input. Is it a SQL? Is it going to be a SQL query? Is it going to reflect the user? So basically always put your sanitizers just before the point of view. Sanitize your outputs not your inputs is the sort of the phrase to remember. So I think we're pretty much out of time. So things to remember, don't trust anything from the internet. I'm sure you all knew that already, but there's no harm in reminding you. So taint analysis consists of sources, sinks and sanitizers. It's quite powerful techniques. It's worth bearing in mind. Anything that passes from a source to a sink without a sanitizer, that's potentially bad. That's an avenue of attack. This is one technique amongst many. Don't rely on any particular security techniques. Use as many as you can. Both static and dynamic. You know, formal reviews. Anything else. Sandboxing and so on. And I think that's about it. If you want a job doing this stuff, come talk to me. So we have time for one, maybe two questions. Anyone? So I was just kind of curious. You said to use the sanitizers as late as possible. Is that a relative thing? So I'm thinking, say you have a Django web app and you usually have the forms framework. Right before you start using the input, would your class that as being as late as possible? Because if you introduced it afterwards, you're actually dealing with potentially... The answer to almost anything, any question like this is it depends. So I'm not a Django expert far from it. So, yeah, I mean, I guess that sounds... I mean, ideally, you know, your latest possible is... I mean, you could say, well, it's under the ORM. It's the point at which you actually send a packet over the network. But that's kind of impractical. So it's more of a case of where the control flow narrows down so much that that's the only way into that thing. And that is sufficiently late, I would say. All right. Thank you. Another question? Okay. Are there any packages or practices like when a sanitizer's returned its sanitized result that it labels the data or subclasses string somehow to make it obvious that this is now sanitized? I mean, you could in theory. I mean, I think the basic thing is here that you're going to have to trust strings because strings come from so many places in your program. There's also internal strings. I think it's... In terms of doing this dynamically, I think it's just make sure that the input is very clearly not to be trusted. Just make it unusable. I mean, literally, so it's guaranteed to raise an exception if you try and use it almost in any way. Apart from it's like the designated methods to turn it into something safe. Okay. One more short question. Okay. I don't see anyone. So, yeah? Okay. So, let's thank Mark one more time.