 Our next speaker will be talking about Scrubber, an open source compilation to protect journalistic sources. Please give a warm welcome to Ethan Gregory Dodge. Thank you, thank you. I am super excited to be here at the privacy village, privacy crypto village. It's one of my very favorite villages and this is my first time speaking at it. Yesterday or the other day I told a co-worker that I was speaking at the crypto village and he's all like, do you know anything about crypto? So not talking about crypto today, however, I'm talking about privacy and let's jump into it because I only got 20 minutes here. So I'm, Ethan Gregory Dodge has born and raised Mormon. That has nothing to do with this topic because it's from another, another slide deck, but I'm a digital forensics professional and I am a journalist. I'm in the co-founder of the Truth and Transparency Foundation and the Truth and Transparency Foundation was originally called Mormon Leaks, which is where my Mormonism comes into play. And we are essentially a nonprofit investigative newsroom dedicated to empowering the disenfranchised by promoting transparency within religious institutions. So essentially we get documents from anonymous sources all the time and well, I'll tell her story in a second. Well, actually let's just jump into it. Anyone know who this is? This is a picture of? Who is it? Reality winner and what happened to reality winner? Right. Right. Yeah. Yeah. So essentially what happened was she leaked a document to the intercept and she printed it out and the printer had left a fingerprint or a watermark that was nearly invisible and then she scanned it in and sent to the intercept and the intercept didn't know about it and that fingerprint revealed the printer and the time that she had printed it out and then the NSA was able to go camera and see who was there at that time and she got caught and she's now serving time in federal prison. So this is an incredibly important topic and so as a journalist and as the technical director of the Truth and Transparency Foundation, anything technical and security or anything like that falls into my bracket and I was spending a lot of time cleaning and compressing and optimizing PDF documents and I would script it out little by little. We launched in 2016 so we're coming up on three years of doing this and it was extremely tedious. Even the more when I scripted it out it seemed like I found like a brand new attack vector that I then had to account for and everything and anyway if you look back at our history we actually released a ton of documents related to the Jehovah's Witnesses and the source that literally just gave me a hard drive of documents and this is all public info because he has admitted to doing it to the press and it was tens of thousands of PDFs right and so I was like fuck need to automate this I'm not going to do this like and so that's where Scrubber came into play. It leverages several tools, PDF redact tools, OCR my PDF and QPDF to clean OCR and linearize PDFs. The actual code that's cleaning the PDF of the excuse me the actual code that's cleaning the metadata from the PDF and watermarks and stuff like that is in PDF redact tools OCR my PDF I'm just using it to put the text layer back and I'll go through exactly what happened why I need to put the text layer back and then QPDF linearizes it and linearizing PDFs is important if you're going to server them over the web because what it allows you to do is it allows the browser to load one page at a time as you're scrolling through it rather than rather than loading the entire PDF all at once before you could view it and sometimes like we released we released PDFs that are hundreds of pages right and after this process the PDF does get a little large a little larger than your typical file so linearizing it is important for us and essentially Scrubber is just a bash script that's tying all of these tools together and automating the process that I've been doing manually for quite a while. So it turns every single page into a PNG using PDF redact tools. PDF redact tools was developed by Micah Lee from the Intercept and since the reality winter case they could have gotten past that by converting the PDF into a black and white image because the watermark was so was so light that it would have been registered as white when you turned the PDF into grayscale and so PDF redact tools has that option to be able to get around that. You can also so what it turns every single page into an image and then if you want to you can you can pass it a flag and then it will stop in the middle after it's turned the pages into images and then you can open them in GIMP and redact them wherever you need to and then you could run it again and tell it to merge all the pages. It combines them all again into a PDF and it effectively stripping any embedded image data. Like if you were to go to Google Docs or Microsoft Word or whatever and you put a JPEG or a PNG into that document you and then you just download it as a PDF you could then extract that PDF and potentially get that I extract that image from the PDF and get the metadata that can potentially be identifying and incriminating so what this does is it actually it flattens the entire thing into one image and gets rid of that attack factor. And then it because because it we turn it into images and then we put it made it a PDF again we have to add the OCR layer and what the OCR layer is is it what it's what allows you to copy and paste from a PDF it it's the text layer right and so OCR my PDF leverages Tesseract it's an amazing tool and it's funny because scrubber is actually like a wrapper around several other wrappers around several other tools right so we're getting pretty we're got wrappers on wrappers on wrappers here but and then OCR my PDF will also handle compression which is super nice and make it make sure that you're getting the smallest PDF size that you can while still getting a pretty decent quality sometimes you are going to have to sacrifice quality for for size and whatnot but that's just the name of the game I mean they're all they're always still readable right and that's the important thing and then QPDF linearizes it for optimum web browsing like I said it allows your browser to just load one page of the time rather than the entire PDF and then this step is not necessary but because I'm OCD I still remove all all the meta the PDF metadata using XF tool um like I said that you don't really have to do that because the meta the metadata that isn't the XF data that's in there is um generic to end and unique to this tool um so this is um this is the best part about it in my opinion is I've dockerized it and you and because of that you could then run it on any operating system so um essentially I'll go through I'm going to do a demo here in a minute um but you you can clone the repo run the script and it will pull the docker image down from docker hub or if you want to build the docker image locally and you don't want to um you don't want to risk your operational security by pulling down from docker hub you can build it locally and then it'll and it'll run it to building locally does take forever because it's a huge ass image it's like one and a half gigs I'm gonna I'm gonna work on decreasing that but that's that's where it is now takes about like anywhere from 10 to 30 minutes depending on how fast your network connection is um and the and the and your cpu processing power so essentially you run a script scrubber dot sh it then uh it either pulls down the image the docker image from docker hub or you build it locally uh starts a docker container it runs pdf redact tools ocr my pdf qpdf and xf tool and then it spits it out into an output file um so the benefits of scrubber is you can run on any uh on any operating system thanks to docker even windows um caveat I've never used docker on windows so I uh I don't know how hard it is to get installed or anything but from what I understand it's it's pretty simple um and uh it can handle large pdfs um I have so pdf redact tools actually leverages image magic to turn uh to turn every single page into a into an image and if you're doing large pdfs that can get that can take up your memory quite a bit I have it programmed to take up to four gigs of memory you can you could change that up if you need to you would just have to rebuild the docker image locally um I have 32 gigs of ram on my laptop and so it's not a huge deal right so um and and the reason why I have it going up is because uh thank you is because I have um like I said we've released pdfs that are hundreds of pages and turning images hundred hundreds of pdf pages into images takes a shit ton of memory um and then uh it provides just a one stop tool for cleaning and optimization right that was the biggest problem is that like all right I can use this tool to clean it and then I gotta use this tool to optimize it and I and and I gotta use this tool to for a long time I didn't even know what OCR was like to be completely honest with you guys um and I was like why isn't the text layer in there and I I just I had a naive understanding of pdfs and uh and anyway so I I really like where this is at because it yeah it it takes the pdf from its state that it got that we got it from the source and then cleans it and then optimizes it for publishing on the web um it also supports batches of pdfs if you just give it a directory um of pdfs it'll go it'll go through um and and chain and clean and optimize every single one um that's the that's the url where it's at github.com truth and transparency slash scrubber um and let's do a demo really quick here I recorded the demo so don't worry you're gonna see my cat in this demo all right so essentially there is a pdf like I was saying before that pdf was just uh it's got an embedded jpeg of my cat Lisa um and uh I'm right right here what I'm going to do um oh I'm sorry you can't really see what I'm doing there but when I put it on YouTube you'll be able to see it a whole lot easier um I'm extracting that that image of my cat um and and then seeing that the exit data that was in it um and then what I'm um and then I'm showing you that it extracted the single the single image and it also extracted the truth and transparency logo as well um and now I'm going to actually run scrubber on that um one thing to notice here um you can't really see it but I'm actually passing it um the the full file path and that's a mistake on my part um it doesn't support relative file paths right now that is the first thing I'm going to fix in my rush to get this out I totally forgot to take that into account um don't worry I already have an issue open on github to take care of it so anyway so now it's going through you can see this says running PDF for DAC tools adding the text layer with OCR my PDF optimizing it for publication and removing the access data and then uh it says your final output is here and then right here I'm running I'm going to try and extract the images again if I could actually type it so runs the command and it did extract an image but you'll see that the image that it extracted is actually the entire PDF itself and the it's it's it wasn't able to extract it and then that's the actual clean version and you'll see that that's me highlighting the OCR layer and you will see um let me back up a little bit you can kind of see in in my cat's ears here that the quality was kind of compromised a little bit but um it did its job in in the case of like uh being a uh journalistic source or anything um you just really kind of need the text and and whatnot and get in convey the uh the point so um uh that's all I really had um I'm seven minutes early but I will take as many questions as y'all got if not that's fine too yeah all right thank you thank you she's got a mic coming for you do you mind coming over here sorry I was just asking if you could put the contact slide back up real quick yeah my contact slide yeah okay for sure uh where'd it go um just as uh you'll see down here I actually have the road map road map listed and these are things that I want to fix and the number one issue is account for relative file paths so let me pull the um the uh contact screen back up here there it is yes hello uh so can you talk a little bit about what you do for other file types like if you get a ppt do you try to convert everything to pdf and then run this yeah I typically as a rule of thumb like even if I get like a jpeg or or some sort of image file or like you said a powerpoint presentation or something like that I'll convert it to pdf as best I can without affecting the contents of the file uh sometimes I can get a little difficult and we have actually released like the raw ppt but the reason why we avoid that is because there's it's way harder to scrub uh an office document because microsoft is terrible but um so and well and to to be fair pdf is an awful file format as well right there there's so many I mean and the huge disclaimer I have this in the readme this is not a silver bullet by any means right like do not expect this to completely protect your sources 100 percent there's always a there's always a possibility that they're going to uh to be identified but this is definitely a really really great place to start yeah do you have any plans on uh uh getting it included in like the debian upstream repose so that it could be used in something like tails that would be awesome I have not thought about that but I know who to talk to to see if that can happen so that's a good idea thank you so this great presentation and thanks for making this tool thank you so I could imagine that one response to proliferation of tools like this will just be slightly more insidious I guess watermarking kind of patterns and images and things like that so is there do you plan to support an option that's maybe like I guess you'd call it paranoid mode that basically just extracts you know at black and white everything to prevent encoding and images removes images and just tries to keep the text and like that's a good idea I get that would be really easy to do um that's a great idea actually um I can definitely do that and if you want to help me I would definitely that's I would definitely welcome help on this that's the why I open sourced it feel free to submit pull requests um but yeah great idea I'm definitely going to think about it thank you so it's a really cool tool but I wonder in scrubbing the file don't you incidentally also remove the authenticity of the file to verify that it's real uh yeah I mean that's the part of it that's the job of a journalist right is to verify its authenticity um I mean and it and it I guess it comes down to do you trust that journalist right I mean there are definitely people who claim to be journalists that I wouldn't trust right away but like um I I and this is this is something that we get that we get asked about all the time is how we uh verify that the documents are authentic um and that's not something that we talk about but it is that's that's the job of the journalists and just whether or not you trust them so great question though hey can you put that get hit the github link up back up again yep there it is oh thank you yeah and and back to uh really quick to your to your other question too is uh it's also the journalist responsibility um to to disclose to to let's see what's the word I'm looking for um inform their reader that they are positive that it's authentic or and if and if they can without risking sources to to disclose some like there have been instances in which we um have disclosed how we verified it um because it it wasn't a risk to our sources and stuff like that and and journalists should do that where they can in my opinion uh again I will caveat it with this is I'm not a professional journalist I do this on my side I try really really hard to to do stuff up to a professional caliber but um but yeah thank you all right any more questions all right thank you thank you