 All right. Okay. So, you know, a couple of days ago, I was browsing one of my favorite web 2.0 photo file sharing sites. And, you know, it's a special site. It's specialized for Linux nerds. It's specialized for open source software people. You know, it's a knockoff of a common web 2.0 site. And, you know, I'm getting ready to upload my favorite XKCD comic in case somebody else had missed it. And all of a sudden, you know, I had the sites asking me to fill in, you know, how many shells are in this standard Ubuntu installation. And I'm like, you know, I'm tired of dealing with this stuff. I'm sure most of you have seen this kind of thing. I've seen captures. You know, they're supposed to prevent, you know, automated robots from abusing websites. But, you know, they really, you know, they come in all sorts of shapes and sizes. Some of them are colored. Some of them are squiggly. Some of them are, you know, have grids. Some of them are, you know, ask you to solve, you know, high order calculus problems. Some of them don't. But they all really have sort of two attributes in common. They're all kind of annoying, and none of them really work that well. And so what's the inherent problem here? And the problem is that in order to make a good capture, the capture has to be really hard for a computer. But in order to be hard for a computer, it's got to be, you know, pretty hard for a person. But that means that they can't really be too hard. And, you know, this sort of, this problem allows people to, of a niche that they need to break these things. So in order to talk, before we can really talk about what's involved in breaking a capture, we need to talk about when is it worth it to break a capture? Like, why do people bother with this thing? And, you know, the issue is that there tend to be sort of two types of targets involved that are being protected by captures. There are low value targets and there are high value targets. And the distinction between these is key to understanding what kind of, you know, what kind of threat you need to consider. High value targets, you know, your capture is really sort of, they're not really going to do a great job. Because if the target is high, if the value of the, of making a post, you know, in the personal section of Craigslist, for example, is 10 cents and it costs one or two cents to hire a person to solve the capture for you, then it just, it doesn't matter about the capture. On the other hand, you know, if you have, you know, if you're a spammer and you're trying to spam the entire internet or every blog on the web, then you're much less interested in these high value captures and you're much more interested in automating the procedure in such a way that you can hit a huge number of sites, you know, without, without paying someone, you know, one cent every single time you post. So this talk is essentially focused on what is involved in breaking a capture with automation. And they're really, we consider there to be sort of two approaches to doing this. In one case, you can try and attack the back end implementation of the capture generator. And in the other case, you can try and break the actual capture problem. You know, sort of solve the crux. So we're going to talk a bit about implementation style attacks first, and then Scott will talk a bit about attacking the actual capture problem a bit later. So one of the really common problems in captures and in lousy implementations of captures is that they don't record enough state on the server side nor to tell whether or not the capture, whether this is not the solution is legitimate. There's sort of two main classes of, of state that, that tend not to be recorded in some cases. In one case, people don't actually tie the capture to a specific request. And this is, this is particularly relevant because it means that someone can say, batch request a large number of captures, solve them very efficiently in, in a batch mode, and then use those solutions over a long period of time. You know, this is actually a fairly common problem that's using, that's present in even some of the, some fairly widespread captures. The other problem that's a little bit less common is that some implementations simply can't tell whether or not a capture has been used before. And so it's possible to reuse the capture, you know, several times over the course of, usually a certain time period. So here is an example, this is a chunk of code from a, from a defunct software project that's, you know, the software project defunct, but you can find the code living on in, in various sites across the web if you'll look hard enough. And all this is doing is it's, it's comparing the, the inputted solution here, which is, you know, the, the input solution is, is string and is comparing that to a calculated code. And it calculates the code by taking the, the MD5 hash of a site specific key plus the, the request token plus the state field. But the state field only changes once a day. And so, you know, a spammer can simply solve this capture once and then reuse that, reuse the same solution all day. And this is not an uncommon problem. The, another common problem in sort of low end captures is that the encoding is lousy. For example, the, the solution of the capture is included in a form parameter and say rock 13. This is usually done in the name of, you know, sort of, you know, the, the mythical horizontal scalability. Because we, we don't want any server size. We don't want to have to remember what the solution is. So we just sort of encode it. I was a little, I was, I was originally a little bit worried about finding a, finding a, a good example of this problem. But here's something from, from an actual webpage online. It's the reasonable amount of traffic. And you'll, you'll notice here, the capture is asking you to solve a simple arithmetic problem. And there's, there's an X and a Y and an operation. But there, there are three fields in the image URL. There's an X and a Y and, and an L field. And it turns out if you, if you look at it, if you subtract, if you add 12 to the, to the X value in the URL, and you subtract 17 from the Y value in the URL, then you end up with the, with the actual capture operation. And it gets a little bit better. The, the people who wrote this capture were, were feeling a little bit lazy. And apparently they don't, they don't want to, they don't want to do that themselves on the server side. So, so they encoded the solution itself and the form. Of course, the solution itself was the MD5 of the solution. And, and when the solution varies from, you know, integers from say 0 to 20, that's not a very large search space, search space. So another, another fairly common problem is that the, is that a single image URL will always return the same letters. But, you know, because of the, again not wanting to retain state on the server side, it will randomize sort of some of the parameters of the image generation. So you end up with lots of different looking versions of the same thing. This is a problem because if you're using an OCR-based attack, then it allows you to very easily sort of pump up the accuracy of your, of your classifier. So we're going to take a quick look at how some of these, how using some of these attacks looks from a, from a code perspective. Okay, so this is, this is a very simple basic cache implementation. And some very, very basic Python code to, to talk to it. So, first look, see, you know, it basically works. If you type in the word correctly, then it lets you through. If you type in the word incorrectly, then it doesn't let you through. But, you know, the first implementation has a, has a couple of problems. You know, in, in particular, this one, I think has, has, has a static token. It never changes. So this is sort of a, a basic example of, you know, how, how do you, how do you post a request? What's the basic template code for submitting a cache solution? So, you know, as you can see, it's clearly, clearly trivial to break. So upgrading it a little bit, you know, we try and try and randomize the cache solution. But here, you know, if you, if you look at the URL, you see, you see a poorly encoded version of the solution there. If you look at it, in any, I think, right-thinking hacker might take a look at a few, a few letters that, they're all ASCII letters, but they're not quite the letters that you see there and think, well, I don't, I don't know, what, what would an engineer do? They'd use ROT 13. So let's try that and see what happens. You, you, clearly type that out. Anyway, and so, so they, you know, your, your favorite, you know, your, your social bookmarking site has gotten wise to your attacks, and so they, they've improved things a bit. And now they've, but now they still, they still allow you to request multiple versions of the same image. Right? So first, we're going to show you sort of a classifier-based attack if you, if you can't request the image more than once. Like, how well does the classifier really work? Just, just at the baseline level? And you can see it doesn't work very well at all. But taking advantage of the, of, of this ability to request multiple copies of the same image and using, say, a simple voting algorithm, then you can see that it takes, it takes substantially longer, but it works with substantially better accuracy. So moving back to, so these have been sort of some very, very sort of basic, run-of-the-mill vulnerabilities that you see in a fair number of places. But let's take a moment to look at some, say, less common vulnerabilities that, a little bit more esoteric, but that I think are interesting nonetheless. So one of the key problems in, in making a Capsha is to generate the code that you're using. And most people, you know, they use some sort of random number generator. Now, we all know that making random number generators that are good is hard, and keeping them that way is sometimes even harder for some people. But, you know, by, by and large, you know, you'd think that people will be able to do this. But the problem is that most of the people are writing basic Capsha packages for, for a web application aren't really familiar with it. So they usually fall back on just the basic random number generator provided by, provided by their scripting language. So for, in PHP, they might just be calling RAND. Now, admittedly, reconstructing the sequence from just, from just a couple of Capsha letters is, is a little bit tricky because you're looking at truncated output, sort of a large number indexed into a fairly, a fairly small number of bits. And then to select, you know, whichever character in your character sets being used, you have to worry about sort of intermediate requests to the random number generator. And you have to worry about, say, multiple servers sitting behind a load balancer. But, you know, even, even if you have to do all of this stuff, you'd really like the security of your site, of your implementation to be independent of the actual sort of web server configuration. So it's just, sleep easy, use MD5 combined with the key and the output of random number generator. Just for kicks, let's take a closer look at, at a, at a Capsha implementation with this problem. So here's a bit of code from a fairly easy to find PHP Capsha. And as you can see, it uses RAND and it simply indexes the output into a character set. So what does the PHP RAND function do? You know, perhaps it, perhaps it does something magic and, and secure, like, like you'd expect from PHP. But, you know, looking more carefully, you say, no, not really. It, it chooses, tries to choose intelligently between different libc implementations. But, you know, that's, that's not really doing so much. I mean, is, is RAND any good? Well, you know, it's a linear congruential generator. You know, there's a couple different implementations. You know, but, but I think everyone here should know it's, it's bad. But, but how bad is it, is it really? Like, how many bits of output from RAND do you need? How many samples do you need to reconstruct a sequence? And it turns out not that many. For example, if you only have, you know, the high order 20 bits, then you still only need three samples to do that. And if you only have, say, five bits from every generator random number, then you can still reconstruct the sequence perfectly without, you know, roughly 12 to 13 samples. Which is sort of well within the range that you can get from most, from most websites capsules. Now, you might, you might think that random is a little bit better if you, if you look to the man page. You know, it says something, something fancy about, about a non-linear additive feedback generator. Maybe that's doing something pretty cool. But this is not one of those cases where the people who are writing the documentation don't actually know what's going on. If you look at the source, it turns out it's just, just sort of a linear feedback system, which, which really means that any individual number is sort of the sum of, of a couple previous numbers. Maybe, maybe plus one. So, you know, if you, if you're more interested in looking at this random number generator stuff, you know, here are a couple of references. You can find them on the, on the slides on the CD and on the website after, we'll show after the talk. But, you know, for now, now Scott's going to talk a little bit about breaking CAPTCHA using OCRs. So we're, we're recovering parole programmers up here. So we're pretty lazy. We like to use other people's stuff wherever possible, even better if we can use it just completely off the shelf with no work involved at all. So fortunately there's some really good off-the-shelf OCR packages out there. We have Google to thank for some of that. One of the particularly good ones that we like to use a lot is Tesseract. Essentially our approach is just going to be to use these OCR engines as a black box. All we really care about is pre-processing the images to make them as suitable as possible for the OCR and then post-processing the output of the OCR to make sure that we have a sensible solution. For example, a lot of CAPTCHAs are just going to be all lowercase characters or only numbers or something like that. So essentially what we're going to do in the pre-processing stage is make the image from the CAPTCHA as close as possible to typical text that OCR engines like to analyze. So a lot of that's going to involve removing noise, smoothing noise, despeckling, removing a bad background, stuff like that. So we can automate this as much as possible, usually just with a couple of filter stages. It's important to note that this actually isn't adding any data. It's not giving us any more information about the CAPTCHA. It's just making it more suitable for the OCR. So here's a couple of examples of some simple CAPTCHAs with pre-processed and post-processed images. So for the most part this is just a matter of like the topic, the top example for example is a matter of just a thresholding algorithm, right? All we have to really do is drop out the background and the foreground is all black so it works perfectly. In the second example these people think that they've been a little clever by adding some lines to the background. There's not a lot of contrast between the characters and the lines so that could be difficult. But the lines are really thin so it's easy to just blur those, get rid of them, and then run the thresholding algorithm on the result. These are pretty easy examples. You wouldn't really see these on a high profile site, at least not anymore. But we can even handle some pretty higher level examples too. So the top image for example has also taken the low contrast approach. They're trying to reduce the contrast between the characters in the background, use some funky colors, run some gradients everywhere, make it a little more complicated. But once again this is just a matter of an edge finding algorithm and then we run a sample of the same simple stages. Finally in the last example it's starting to get a little harder, right? We have a complicated background. There's hardly any contrast between the foreground letters in the background and it's not going to be a simple matter of just blurring it. So we can do some complicated tricks here to try to separate the textures if we just use a convolution matrix. This convolution matrix is really optimized for removing that specific texture and that specific scale of image. But really most captures are going to use similar textures between the background and the foreground across all of the images. So we can pre-optimize the convolution matrix for that capture. If you really want to make a secure capture try to use these tricks in all sorts of bizarre combinations and randomize it as much as possible. That means the user has to write a classifier to identify what type of randomization algorithm is being used beforehand. So if we want to improve the quality of OCRing in general we can retrain the OCR for a specific character set for that one capture. Sometimes this can be really simple just a matter of generating massive amounts of data with like latex or something like that. Sometimes you have to use actual capture data. Here's a simple example with the capture we saw on the first the second example of pre-processing. If we take the baseline accuracy of Tesseract with the English training set we can read the whole capture word about 28.5% of the time. And that's with 2,000 samples. If we then crop the character set so that Tesseract has been retrained for just uppercase letters we can bump that up to 44.2% which is pretty significant. It's huge for a spammer. And then just to further illustrate the significance of presenting multiple variants of the capture if we vote between multiple OCR results of the same word we can bump that up to 96% which is just unrealistic accuracy for trying to defend against. And now Mike is going to talk about some other strategies for catching for cracking audio captures. So shortly after the first capture went out on the web web developers realized that you know not everyone on the web can see. Everything that's on the web that can't see isn't necessarily a computer. And so there was suddenly a call for accessible captures. And the response to this was audio captures primarily. Now audio captures you know they're reasonably common you can find them on on most high profile sites. And they offer some advantages to the attacker. In particular I think there are sort of two primary things. On the one hand frequency domain filtering of sound is much more intuitive for most people. You know most people you can't think about frequency filtering for images. And that's if you do a low pass filter and cut out low frequencies you end up blurring it high pass filtering. So it translates to an edge detection algorithm. But it's not really intuitive. But everyone thinks about frequency based filtering when they think about audio. And on the other hand I think that there's in a lot of senses there's less room for noise in audio captures. This is primarily due to two things. The first one is that generating good audio generating speech that sounds good is actually a quite difficult problem. And the other thing is that when people listen to somebody else and they understand what they're saying a lot of the a lot of the information about that a lot of the information that goes into making the decision to say to classify that kind of information is coming from say nonverbal cues and prior contexts and other other information that is not precisely encoded in the audio. And so this means that when you're trying to design an audio capture that will be understood by by a very wide spectrum of people you have very limited you have very little room to add noise that interferes with segmentation or to add noise that that would or to add noise that for example people would have trouble understanding. So there's very little room for multiple accents and things like that in a useful audio capture. And when you combine all these things it turns out that in many cases it's not not even remotely not that difficult to get decent accuracy on these things. So when you look at an audio sample you generally think about analyzing it in one or two domains you're either looking at the time domain in which you'd have a bunch of samples sort of moving up and down over time or you're looking at it in the frequency domain where you're looking at the the distribution of you know energy in a particular frequency band you're looking at how base heavy it is how treble heavy it is all the frequency bands in the middle. So first let's look at the time domain readout of one particular audio capture. Now you look at this capture right this is the time domain representation and you see you see a bunch of spikes you see a whole bunch of spikes. Now I wonder where the numbers are in this capture. It turns out if you look at the green arrows those are all real numbers and the red arrows are all not quite real numbers you know they're all sort of this noise has been added to try and to try and disrupt an automated classifier but still just looking at the peaks and using a simple peak finding algorithm gets you a really quite a far away towards looking at this stuff. So we have a short demo of running a simple peak finding algorithm on an actual audio capture I'd like to play. So running this is going to play. So as you can see this this capture I think is encoding nine nine six three or something that affects but what happens is that the peak finding algorithm finds all of the numbers but it also finds a bunch of sort of a bunch of noise that's been specifically designed to interfere with a capture but the problem of actually solving that is sort of orthogonal right once you've actually extracted all the little all of the individual number segments then you're left with the problem of classifying classifying between noise in between the different numbers. Yeah so right okay so in order to classify in order to classify these segments then you need to you know basically extract a small set of relevant features you know say a vector with maybe you know somewhere between 10 and maybe a hundred or 200 elements that you can use to write your classifier and on top of that then you need some sort of classifier algorithm to use that feature data to develop the actual matching set. Now the thing about it turns out that in speech almost all the actual information is encoded in the frequency domain right and this should be intuitively in clear because you know people can talk at different speeds and they can talk at different loudness levels and it more or less still works you know people can still understand what you're saying so in order to it turns in it also sort of turns out that it's possible to in an efficient way calculate the the frequency content of a sample of audio in very in efficiently in a computationally efficient time using something called the fast 48 transform and that now the great thing about doing this is that once you've done this once you've converted your audio to the frequency domain it becomes really straightforward to do frequency level filtering to run low-pass filters to do things like that and it also becomes you know and all that's more intuitive for sound than for image but one of the issues is that you know if you just run the freak the you know a 48 transform and find the frequency content of every single frame in a short audio clip then you end up with a huge amount of data a lot of data and you know I actually looked at a couple different feature sets and it turns out that one of the most effective ways to look at this is something called the power spectral density and all this is doing is looking at the power of relative frequency is looking at the amount of energy contained in different frequency bands for different for over the entire sample sampling I'm not going to go into all the details of the math here but that more or less worked well enough for me so here's a quick demo of using the power spectral density plus a simple classifier to identify the previously read system so what this is doing is it's running a classifier against all the little extracted samples and then you know using that to figure out sort of which ones are noise and which ones aren't and it prints out at the end the correct answer which is 99663 now I should comment a little bit like what classifier algorithms I'm using this is a combination of a support vector machine or an SVM not again I'm not going to go into the math here but it's sort of a it attempts to look at different classes and find in some sense the optimal separator the optimal division between the two classes and then another another algorithm that simply calculates the error between two points the error between the the measured thing and the average of the training set and it combines those to cumulative score and that works yeah okay so yeah so I think Scott's going to talk a little bit about another strategy for doing large-scale training of image caps so presumably some of you might have been a little upset when we just glossed over actually training an OCR for a specific capture so unfortunately such an approach is really a really really time intensive compared to every other approach we've talked about so far but sometimes you just have to take that approach so this is a sheet this is a sheet of a whole bunch of captures several thousand samples and it's identified with what the OCR algorithm using the base English training set thinks each letter is and where it thinks each letter is as you can see some of these letters are pretty poorly identified a lot of them it's split into two most letters it can't even guess at all so if we really want to make this accurate we have to tell the OCR engine when we're training it exactly what each letters each of these letters are and the complete bounding box around each letter so that it can segment it appropriately and extract the right features most of the OCR engines don't use a typical SVM classifier nearest neighbor like we're using for audio captures they have us completely not completely but somewhat different feature extraction approach so what helps a lot in generating these training sets the OCR engine is going to expect effectively a data file that that summarizes each letter in the training set where it is in the image and what it's supposed to be so if you have to have a human editing these massive training sets it helps to have a tool to convert back and forth between something that you can easily edit and what the OCR engine actually expects so among the among other examples on our website we'll have a couple of scripts to convert back and forth between an SVG file which you can easily edit with this illustrator and the test rack box training file format so really what we're trying to get out here is that almost every capture on the internet right now you can take a reasonable shot at with these with these techniques and that includes the best of the best I mean even top five Alexa sites still have captures that can fall victim to these these techniques at some level for a spammer it's really enough just to get you know two to five percent accuracy if you have a huge button at in at your control then you can throw so many resources at it that it doesn't matter if it if you have a horrible accuracy rate so the gist of this is that capture us aren't really good for protecting resources to that effect so there's a couple things that you can do to try to make captures a little bit more significant the most effective of these is to integrate cultural knowledge so for example a while back there was a hot or not capture that asked you to identify hot or not people in an image and that might be a little more effective but then you're limiting your user base right if you try to ask some Hawaiians when what they think is attractive then it might be different from a set of Australians so we propose an alternate capture the bourbon or scotch capture well I thought we had some funnier jokes in there too but to that effect now we're going to open it up for questions any particularly good questions will be rewarded if you're of age of course with a shot of bourbon or no we don't have bourbon sorry we have we have scotch go ahead anybody nobody yeah that is a very valid concern so what the the question basically was that if you have something that's using a set of cultural images or something like that it has to draw those images from a database if that database is of limited size then the attacker could just completely exhaust all the possibilities and just develop some sort of table to make it easier to to pass that classifier so you have to be careful when when developing cultural captures to make it more difficult to do that so for example in a in a you know hypothetical bourbon or scat or scotch classifier we might try to add randomness to the image or change the background behind the the liquor bottle or add some distortion change the shading something like that there's there's a lot of approaches that you can take to try to reduce the attack surface of a database backed cultural capture like that yeah I think they're already obsolete it's it's so absurdly cheap to hire people to break captures even if you're forced to use a human the vast majority of resources out there are worth more than the cost of hiring somebody to solve a capture I'm not going to speculate on this something that grand right here so so personally I think the training test is a terrible test mostly because people are pretty stupid and it's really really easy to convince a person that they're talking to another person that's just my opinion but anyway other questions I think captures are best used in conjunction with other techniques but you really need to be concerned with the rate limiting user moderation other things like that that can try to reduce the the effectiveness of spamming or breaking into your resource but really you shouldn't expect to rely on a capture all right well I guess we're gonna sorry it's blended Johnny Walker red oh and you guys you guys with a cultural capture questions that's a pretty good question so if you guys want to come up feel free so I think the question was asking what we thought about recapture so recapture by and large is actually really quite good I think I haven't looked at their their audio stuff carefully but I if I remember right it has a little more noise than than the stuff that we looked at and their their image stuff is really quite irritating to segments even though it doesn't have difficult backgrounds they do a very good job of making the core problem of running OCR difficult their back end implementation is already is also sort of quite innovative in that they use sort of public and private keys to to authenticate that a given capture is coming from a given capture solutions from the same site that it was a you know submitted by but it does it does have a couple issues in particular it's essentially client side in the sense that the client requests a capture from the recapture servers rather than from their website servers and one of the side effects of this is that they don't really tie the capture to any particular request and so it turns out that it's possible to you know in one of those simple you know implementation based attacks request a large number of caps for a single site say you know solve them using a very efficient you know mechanical Turk interface and then use those over the course of the next say seven hours so it's a good capture it has a few issues right so so I think that right so I think the question is asking about with audio captures you know the segmentation problem turns out to be very simple and current implementations but there are a lot of things that you can do to make it much harder for example you might you might play a trombone or a music behind behind a person play just speaking numbers and then that makes the actual the analysis much more difficult so I think that one of the real problems here is that I'm not really convinced that people are good enough at understanding these things that you have a lot of room to do that before you start the human error rate gets too large and the usability just gets too low I might be wrong I haven't done any careful studies about that but but that sort of my hunch about why you don't see that more often in current implementations for example and why it might not end up being a problem but yeah I like certainly people really started trying to generate conversational speech and things like that and that becomes very difficult it's really sort of unfortunate that we have to worry about all these pesky users it'd be a lot easier to develop strong captures if we didn't yeah so there are a lot of great text-based captures out there that for example ask you to solve math problems or ask you simple cultural questions but these suffer from the same kind of implementation pitfalls of the of the other database backed cultural questions out there if you have the same math problem that's being presented in the same format every single time it's a lot easier for a computer to solve that than a human if you have cultural questions like this then you have to have a huge set of questions to draw from in order to make it difficult to break that that would probably be the most ideal but if the spammer can access the same live stream then you're limited there as well a good question that's that's a great point and I think that sort of thing can be somewhat more secure but realistically spammers have access to a lot more computational resources than your typical web user especially depending on your your type of client if you're trying to appeal to grandmas with five-year-old computers then it'll be a lot easier for a spammer with a huge farm of EC2 instances to break your capture than for the grandma yeah I'm personally really quite skeptical skeptical about using that sort of computational problem to rate limit things like this simply because you really end up with this with this equation saying well the more CPU power I have the more evil I'm allowed to do and so you drive a very strong incentive for example for someone to say implement your core problem on a bank of FPGAs for example at which point they're doing orders of magnitude more evil than the average person and still destroying your site so I think there's I don't think that's a very good solution in the long run thank you very much