 Welcome back, everyone. Today, we're going to be talking about fuzzy hashing and specifically the tool SS deep. So you might be thinking, what is fuzzy hashing? Well, let's first take a look at kind of traditional or regular hashing using MD5 sum. So MD5 sum creates MD5 hashes on OSX, just MD5. On Windows, there's lots of different tools that can create MD5 hashes. And right now I have several text files inside this folder. So if I run MD5 sum against all of the text files, I get this string, this hexadecimal string. And these, each of these is called a hash value. Okay, so these are MD5 hashes. And what we can do with these basically MD5 sum is reading in the data from each of those text files, the at the, at the binary level, and feeding that into the MD5 hashing algorithm. So the idea of a hash is that for all of the data that comes in, the hash basically takes that data and produces a unique output. Okay, so for basically any data that's exactly the same as whatever is in this 10 period file, this email, if we had another email that was exactly the same, we would get exactly the same hash value output. Now, don't think about the text, think about the underlying data that's that's in there. So comparing it at the bit level, if we have two files that are exactly the same at the binary level, we should get the same hash output. Okay, so in this case, if we look at it, all of these different hashes, we can do a quick kind of check and we see that none of these hashes are the same. So none of the data is exactly the same. Okay, now, these types of hashes are very good for lots of different things in digital forensics. First off, it's very fast. So we can, if we know the hash value of a known good or a known bad file, we can use that hash value to very quickly search through a disk or a bunch of files and see if any files match the known hash value. If they do, then we can either filter them out or focus on them, depending on what type of file it is or what we want to do with it. Okay, now the problem here is that some of these files are very, very similar to each other, right? But the hash value doesn't care about similarity, it cares about exactness. Okay, so sometimes it's interesting to find files that are similar. So for example, if someone was modifying documents, or let's say if they took intellectual property and they kind of changed it a little bit, but they actually stole the original intellectual property and we want to define the original or even the modified version, we want to know about similarities, not exact matches in some cases. Let me remove this. So what I'm going to do here, I have four different text files and I'm going to focus on email 26. So I'll just go in here and show you what it is. Email 26, we can see here this is the email header. These emails are actually from the Enron test data set. We have the email header, and then we have the body of the email. If we look at email 17, same thing, or close to the same we have the email header, and then we have the body of the text, and we can see that the text is very different. So if you look, this is several paragraphs, this is basically just one paragraph with different people in the email, things like that. Okay, so between 17 and 26, there's not really much similarity. Okay, but imagine that I took email 26, and I modified it slightly. Well, it looks like pretty much the same email, except we can, we can kind of see jumping back and forward, probably what changed. But there is a slight modification and I made a slight modification to it. So let's just pretend that I want to change the contents of this email but still make it look like the original email. Okay, so we have the original and we have the modified versions and we have two other emails that are basically not related to this. Okay. Now, out of all of these, I want to know which ones are similar and how similar are they. Well, we can use something called fuzzy hashing. And the tool that we're talking about is SS deep. And I'll put a link below with a link below to SS deep. So SS deep does what we call fuzzy hashing. Now if we do SS deep star, then it should hash all of the files in this directory. And we can see that the hashes are quite a bit different. So let me do md5 sum again. So md5 sum gives me a hex decimal value, relatively short. And then for SS deep, we get this type of hash value. So this is the hash. For SS deep. And what we can do with this is instead of comparing to see if this number and this number are exactly the same. We can basically use this to compare with other five other other fuzzy hashes to see how similar they actually are. Okay, so we're using the the hash values to see how similar two different files are. Okay. So what I can do here is do SS deep. And then I want to hash everything and save it to let's call it fuzzy dot ssd. And this is where I'm going to save all of the hashes too. So I'm running SS deep getting all the hash values and then saving the output to fuzzy ssd. And I'm going to suppress any. Okay, I'll just do it. Okay, then it says did not process files large enough to produce meaningful results. Sometimes this can basically you have to have quite a bit of data to do reliable fuzzy hashing. In this case, the emails that should work. But if your files are too small, you might have some problems. Okay, so now I have, if we look inside the hash, hash file that I just created, we see the SS deep header, we see the fuzzy hashes for the four different text files that I have in the same directory. Okay, now if I look, I want to. Find out if any of these files actually match, or are somehow related so I can do SS deep dash m and then feed it my fuzzy hashes that I've just the fuzzy hash database I've just created. I want to hash or compare to all of the files in the directory and I'm going to suppress any error messages here. Okay, so enter. Now we have a couple interesting matches here. So you notice here demo text 10 period matches demo text 10 period 100%. Well, we would expect that. But 10 is not matching anything else. Okay, so whenever we look 10 doesn't match anything else. It's not similar in any way. 17 same thing 17 doesn't match. Basically in any other way. Then we have 26. So 26 matches 26 100% we expect that 26 matches 26 mod 99%. Okay, so something in that file changed and not very much that the match is very high, very good match here. Same 26 mod matches 26 99%. Okay, that's just the reverse and 26 mod matches 26 mod 100%. Okay, if I want to see all of the matches all of the comparisons that are done I can do dash a and it will show me 0% matches. So in this case 17 matches 10 0%. Well, we expect that the only thing similar in between 10 and 17. If we look at it. So 17 has a bunch of text has this header, the header is to a specific person. If we look at, yeah, even the header is quite different. So we have some things that are kind of similar in the header, but then again, the two is very different. So we have to just one person we have CC, we have a subject here we have to we do have a subject, but it's also different. So the similarity between these would be very, very, very low. And in this case, it's counting as zero just because there's so much text that's different above, basically. So these two files are considered not similar at all. Because even the message ID, even if this tag is basically the same, the ID or the value is going to be different as well. Okay. Yeah. So these two are considered completely different files. So they match at a zero. Where was it? I think. Yeah, here. Okay, so 10 matches 10 100%. 10 matches 17 0. 10 matches 26 0. And that's what we expect. 10 shouldn't match anything else because it's an email and it's quite different than any other email. Let's see 17. Basically the same thing 26. Basically the same thing except for 26 mod where it matches a lot. Now fuzzy hashing works really, really well against text files, or non compressed file types basically but essentially text files if you're trying to look for similarities for example if you have a part of an email. Let's go back. If I was looking for for example some of this text inside the email. Well I could just do searches through the email but if I was looking for for example entire paragraphs or something like that. You could figure out which ones might contain that paragraph depending on how you are searching but this basically just says you know if there is a modification between two files, we can now identify which files are probably going to be modified. Okay, now just to give you another example. Let's go out of text so it works really well for text. Let's go out it doesn't work so well for images. So sometimes it works sometimes it doesn't and the images folder I've set up a couple examples here. I have this cat and this is the original photo it's a JPEG image. It's called Kitty JPEG 1 and then I also have Kitty JPEG 2 which has been modified slightly so a bit of a swirly face here. Kitty JPEG 2 has been edited using an image editing program saved as Kitty JPEG 2 and then I also have Kitty PNG 1 and it's basically an exact copy of Kitty JPEG 1 just saves a PNG with compression. Kitty PNG 2 same thing except with the swirl saved with compression PNG 3 I added a black box to the image a small black box saved as PNG with compression. Kitty PNG 4 just added an extra eyeball to the image also saved as a JPEG with compression. Kitty PNG 5 basically no edits saved without compression and Kitty PNG 6 the swirl again saved without compression. So a couple different images here and we're trying to look to see can we detect different changes in the images now remember SS Deep is not doing image it's not doing computer vision right we are not extracting features from the image and analyzing those features or the similarity between features in the picture we are analyzing the images at a binary level. Okay so what are some I guess hypotheses we can already make about the images at the binary level well I'm guessing that these JPEGs are not going to match. Why well because the compression in the JPEG format basically those two images at the binary level we will be completely different PNGs I think will be a little bit similar and PNGs with no compression I think will be very similar. Okay so yeah it really has to do with the file formats JPEGs I don't think we'll be able to get a similarity match at the binary level PNGs I think we will. Okay so that's my hypothesis here so just like before with the text remember we're doing it this yeah okay so binary level we're not doing computer vision here that's a different a different thing. So we can do SS Deep star and then I'm going to suppress the text dash s. Okay so here's all of the hashes we have for each of the file so what we need to do now is create the hashes for all of the values I'm going to suppress the errors. And save everything to fuzzy dot ssd so we're going to take all the hashes save them to fuzzy ssd and then if we cat fuzzy then we can see all of the hash values. Now we want to compare with all of the files that are currently in here and see if any of these images are actually similar so we can do ssd and then feed it the file dash m fuzzy ssd. And then I want to compare with all of the all of the files in the directory and I want to suppress all of the errors okay so hit enter. Now we can see a few things right so just like with the text files we have JPEG one matches JPEG one okay we expected that but we don't see JPEG one anywhere else so JPEG one is it looks like it's not matching anything else okay and this was the original file remember this is the file that everything else was built off of. JPEG two basically the same thing it matches JPEG two but we don't see it anywhere else okay so yeah it looks like my JPEG hypothesis is working out the JPEGs don't match each other because they were modified and most likely well because of the compression in JPEG it looks like that the files don't match the binary level because they basically get recompressed. Right so PNG one is basically a direct copy of JPEG one and it matches PNG one now PNG one also matches PNG two forty one percent so let's look at that real quick so we have PNG one. Which is the cat with basically no modifications PNG two the cat with a blurry face okay so the only thing we did is on this layer blurred out and then saved right so they both have the same compression level. The only thing that basically changed was the blurry space right and apparently this space is enough with the compression that's used to be able to match up to forty one percent okay so yeah forty one percent here so matching forty one percent with some modifications okay. And that's because of the way that yeah basically the file format and the way that PNG compression works or the PNG compression and GIMP works okay PNG one matches PNG four fifty five percent okay so let's take a look at that. So PNG one matches PNG four fifty five percent now why is that okay well basically the same reason instead of having it blurry in all of this space we've only modified actually a very small portion of the data so this is a relatively small portion the other one. Let me open up to again we can see that the area that was modified is basically let's say this big if you can see that it's this big where the area in PNG four is much much smaller so even though it's smaller is still only matches fifty five percent again because of the type of compression that's done we're still losing some information there. Okay so those are the matches for PNG four and PNG one. Now you might notice something on here PNG three okay we matched PNG two and we matched PNG four so forty one percent fifty five percent because of the amount of data that was actually changed in the original image. But what we don't see on here is PNG three now let's look at PNG three. Okay so this is basically just added this box so why doesn't this match well it doesn't match because whenever I added this box in GIMP I actually added a new layer to the image. Okay so whenever I added a new layer to the image it essentially changed all of the data. Okay so by adding a new layer over it I've changed all of the data because I've added basically a new layer on top of the entire image. That's changing all of the data underneath whenever it finally gets rendered basically so PNG three doesn't match even though it's just there because of the way that I added the layer. On these other ones I just modified the layer directly I did not add another layer on top okay so that's an important I guess important to think about the way the way the editing makes a huge difference in whether you're going to get any type of match or not. Really if you're trying to do if you're trying to compare two images you should be using feature extraction and not binary matching but I'm just showing you basically that it is possible but really some things you have to watch out for. Let's see PNG two PNG two matches okay PNG three matches 100% PNG four and PNG one just like we saw before 55% PNG four and PNG two 40% just like we saw before. And then PNG five with no compression matches PNG five with no compression PNG six with no compression matches PNG six with no compression. Yeah and they don't they don't match each other either that's basically just changing the compression levels around now if we go back and do dash a we can see all of the ones that didn't match as well. PNG five no compression yeah so everything else basically matches zero so it's automatically filtered out now we can also use SSD to filter out only like matches that are above let's say 50% or 60% but the problem here is that PNG two and PNG four actually matched they are they are similar because they're derived from the same image but they only match 40% and there's a lot of different reasons for that low matching but this is actually a match so choosing be very my point is be very careful about how you choose to filter if you filter based on for example I'm going to take everything that's 50 50% or above match well that might not be very good depending on what types of files you're looking at working with some data types especially text you usually want a higher match than before for example like 80% 90% something like that. But really you have to understand your data set and what you're trying to match before you can figure out where to set your threshold so just be aware of that don't just arbitrarily pick you know 60 and above because you would have missed these two. But if you pick anything you know lower than 50 you might get a lot of false positives so just be aware that you really need to understand the data that you're trying to match here. Okay, so again like I said whenever we're matching images you should be using feature extraction not not necessarily binary extraction but just know the data types. If you are looking for data types that you know can match it a binary level okay then no problem but for images or video you really should be doing feature extraction and similarity matching over the features. Right so let's see. The only other thing I have this show is if we go back to the text. If we go back to the text now we already know that 26 and 26 mod are related we found that out by using ssd but I want to know exactly what was different or what was different in the in the files. So I can use diff diff in Linux and then just say 26 and 26 mod and this will tell me exactly the lines that were different. So here it said inappropriate the original 26 was the original 26 mod is the modified so the original goes first then inappropriate materials such as this are opened. Okay well in the new one we have inappropriate cookies such as this are opened right so using diff is a good way to see exactly what line was changed once you find out that two files are similar. Now we could have done the same thing. Yeah okay anyway that's just to show you how to pull out or how to find what was changed in two different files. Okay so that's it today for fuzzy hashing and ssd. Thank you very much. If you like this video please subscribe for more.