 Our next talk is about the deconstruction of academic paywalls. I'm sure everybody here knows and has seen those specific kinds of paywalls. And as you also might have noticed, those academic paywalls differentiate themselves, unfortunately, from other paywalls on the web, not by making themselves transparent for the disadvantage, as one might expect you to the importance of proliferation of knowledge, but rather by even more exploitative pricing, which is why I'm very excited for Storm Harding and his talk, Jumping the Paywall. Thank you so much. So, welcome. Thank you, guys. Thank you. So, thank you for coming. So, hi, everyone. I'm Storm. I'm from the Internet. Today, I'm here to talk to you guys about paywalls. So, paywalls are very basically, you're probably familiar with the concept, that's when we have some piece of knowledge, passive information, and we cannot get to it without paying. Today, we'll be talking about how do we jump through those paywalls. We're going to break it down into kind of the theoretical approaches that academics have taken to this problem, then we're going to actually talk about practical solutions to how do we extract contact from pay castles, and then how do we remove any potential trader tracing or watermarking of that content. So, before we start, we, of course, have this disclaimer that any particular tenses I may use, any particular tenses you may hear, are not indicative of any kind of injunction to action. We're all operating on a purely imaginative framework. So, with that in mind, let's look at starting the theoretical overtures that how academics have grappled with the problem of paywalls. So, this guy, Gary Hall, in 2009 proposed what will term a legalist approach to dealing with paywalls. We can see some of the approaches that he proposed, such as asking for permission to share an article after it's been published, or adopting a don't ask, don't tell policy, where if you publish an article in a paid journal, you could then kind of on the slide put it online yourself. So, this last point is particularly interesting to me, and what we're going to be focusing on throughout the talk is if we can put articles online for free, why should that permissibility be restricted to the figure of the author? Or, in other words, why do we need to repeat the so-called giant chance of the legalist? In other words, why do we need to engage in legal discourse when we talk about unethical acts? And so, onwards, later on, Stifras and McLeary developed this policy that any strategy, as you see up here for contesting the law, should proceed through more than just legal channels. And so, that's what we'll be exploring today is extra-legal or non-legalist modes of intervention in the copy fight. And then, finally, I'm assuming most of you guys are familiar with Aaron Swartz and his Gorilla Open Access Manifesto, which we'll be more or less adhering to today. And finally, a final note that this is not a talk in defense of copy-left as these talks all still often are. Instead, today, when we jump to paywall, we view copy-left as, in fact, a much more malignant enemy than traditional copyright. The reason for this is because copy-left presents a sense of acceptability. It makes copy-right palatable. If this seems contradictory, the problem is that copy-left does not question intellectual property itself. It merely changes the directive from thou shalt not the traditional copyright injunction to thou shall, the fall permissibility of copy-left. What it doesn't then question is, why should anyone dictate that thou in the first place? Right? So what we're going to be doing today is challenging the notion of intellectual property at its core. Copy-left does not do that. In fact, copy-left entrenches it all further. And finally, things like Open Access, which are, again, fundamentally reformist, nonsensical injunctures to, again, propagate intellectual property, because here we see some of you may be familiar with PLOS, the Public Library of Science. And on the one hand, they claim that they stand for unrestricted access and unrestricted reuse, but in the next paragraph, in their mission statement, they say that they apply the Creative Commons Attribution License. Licensing, of course, is inherently a restriction. The particular terms of the license do not matter. What matters is that someone feels they have the right to set a license in the first place, and that is what we're fighting against today. And thus, we reject copy-left, we reject Open Access, and we embrace the copy fight. So that was kind of the theoretical standpoint that I'll be coming from. Now let's talk about how do we actually liberate content. Where's a set of rules to follow? The first is always be pirating. Always steal books from the library. Never check those books out. The question is why, right? We've all been brought up to be good citizens, right, who rent books from the library, return them on time, don't pay our fines. What this does is creates a convenient tracking database, which can then be, of course, be correlated to your online distribution activities. So let's say you're fond of a particular forensics journal that a local community college library has, and you always borrow and return it on time, but while you're doing that, you also scan a copy and post it somewhere else. Let's say Elsevier, one of the owners of the journals, then decides to start checking library records. And oh, who checked out this particular journal and these particular issues which then went online? Now, of course, you may be thinking, but other sites may be keeping records, too. The difference is that your library record is usually, unless you took the precautions to use false identification to register, is linked to your real identity and can lead to source neutralization. That's one of the main problems that we'll be talking about today through particular case studies is source neutralization, which is when the adversary neutralizes the source of content, in other words, by shutting them down, by sending a lawsuit by arrest, sometimes through grim re-circumstances such as suicide. So like you said, don't use the library unless you have to and if you do have to, always steal from the library. Going further than into alternative digital vectors, there's, of course, library genesis. The current mirror is .io, though if you may not know, the actual URL bounces around in the round robin sequence, so it may change to .vis, .org, and so on. So exactly how big is library genesis? Well, most recent studies show that it contains 38 million academic articles totaling 28 terabytes, which constitutes 36% of all indexed academic articles that have a DOI or a document object identifier, which is kind of like an ISBN for journal articles. So that's one of our main resources. A related resource is SIHUB. SIHUB is a round robin sequence of .edu and .ac.uk proxies, which you feed an article DOI or other URI like the URL, and then you could go and get the article. So basically what SIHUB does is what we've been doing throughout the 90s where you go and you find public EDU proxies that have access to particular journal subscriptions, and then you write a basic scraper to collect all the content and distribute it otherwise. SIHUB automates this task by automatically mirroring any particular article that you access on the LibGen archives. So these are the two main resources that we should use in lieu of dangerous physical needs-based libraries. There's also a growth lately in crowdsourced resources. Reddit, the subreddit scholar, has a request section and a fulfillment section where you could post the DOI that you want and someone can find it. On Twitter there's also the recent hashtag I can has PDF where you make a tweet with again the DOI or the URL of the article you want, and anyone who has access will send you the link. And there's a couple more, more or less obvious resources that we nonetheless should not overlook, such as Google Scholar, which oftentimes leads to open versions of articles that are otherwise pay-walled on other mirrors, and then you should also always check the personal pages of any particular author, because sometimes they put articles online there. And again, going back to the dreaded library, if everything else fails and you have access to a university, go and try to procure the article from there, but be sure to use open login terminals. Some of these may be non-obvious. For instance, if you're faced with just a basic catalog screen, something's tapping, something like the Windows key, and then right-clicking going to desktop, and you could basically escalate privileges to obtain access where you don't require registration to view articles in an educational setting. And so now what we should talk about is this last resort. Let's say that we can't find what we're looking for online. We actually have to check out to a local, at least Wi-Fi hotspot that has EDU access. We need to practice good operational security when we're actually on adversarial territory. Some of you may be familiar with the case of Aaron Swartz. If not, so Aaron was essentially arrested for downloading a few articles from JSTOR, which is a particularly popular academic database from a server rack at MIT. What Aaron did was he went to a particular server closet over and over again, plugged in his own hard drives, and then liberated a few million articles. What led to Aaron's arrest was that he went back to the same closet. In other words, the admins noticed regular activity coming out of this random server rack, and then they set up CCTV surveillance in that space. The first rule, as always, is to never return to the same feeding hole, to always pick a different source if you're practicing actual offset in the vicinity. Another particular item to keep in mind is do not create any record of your existence. If the particular facility that you're accessing requires swipe-through or smart card access, you can try to social engineer access into the facility by, for instance, taping electrical tape over the mag stripe and then complaining to a security guard that your card just doesn't seem to work, often than not, just swipe you in. Hey, so by now, at this point, we should have procured some articles, right? Before we actually start sharing them, we now need to engage in content defying or removing all of the actual poison that the venomous publishers inject into these articles before we can share them to, again, prevent target neutralization. So there's three basic types of so-called bad things that publishers can put in articles. Content protection, watermarks, and metadata. So let's go through how can we potentially deal with each one. So let's start with content protection. So content protection is very basically stuff that prevents you from doing stuff to the article, sometimes things like copying it, like printing it, or reading it. And again, there's very, very many easy-to-use tools that we could deploy to defeat content protection, one such tool is the Advanced PDF Password Recovery Pro, which can also brute force passwords to PDFs if they're not just content protected, but also password protected. And again, this would work for very basic protection for more advanced protections such as Adobe's more recent Lifecycle program, which requires connecting to a server in order to get a temporary certificate to view the article. What we can, in fact, do is spoof the server to localhost and not going to go through that because there are existent, very detailed guides. The point to take away is that this is very easily done if you just look up how to do it. So that was the content protection, right? That's usually the overt form of content fanging, right? Or in other words, a protecting content. It's very obvious when you cannot copy a particular article. Much more in the various latent form is usually watermarks. Watermarks function by, once again, the content protection being embedded into the actual article. And these can be things like marginal notes, like an article would say that this was downloaded from wherever, at whatever time, from whatever IP. So this will be the first kind of watermark that we're looking at. And again, this is relatively straightforward, right? This is things that you could see in the marginalia of an article. So let's get rid of it. A basic tool to use would be, in first glance, BRIS. BRIS is a cropping tool where you can input a PDF, select the margins, and crop out the potential watermark. So here we have a censored article. We're on the left. We have before BRIS, the extensible watermark marginalia on the left-hand side, and then after, seemingly without it. The problem, though, is that BRIS performs what is in fact known as a non-destructive crop. This means all it does is adjusts the actual margins. It does not delete the content. So if you naively download BRIS and crop out the margins, forensics examiners will still be able to retrieve the watermark that will be outside the printable margins, but will still be embedded in the PDF. Instead, what is necessary to do then is entirely reprint the article, not simply recrop it, and select the margins within the printer parameters. Once we do that, we find that actually printing it gets rid of the marginal watermark more than BRIS does. Hey. So that was the very basic kind of watermarking that we can't encounter. There are other much more sneaky watermarks that publishers may potentially put in. The first is known as natural language watermarking, or NLW. The way that NLW functions is instead of adding extraneous information into the article, it modifies the actual syntax of the article itself. So a very basic example you see up here would be one iteration of an article with, say, I ate a green cupcake yesterday. Another one would say, yesterday I ate a cupcake that was green, or yesterday I ate a green cupcake, and so on and so forth. Once any given number of sentences are modified, the particular tracking algorithm can then deduce which version of the article was watermarked, or which source it came from. And of course the flip side of this is that this is very trivial to defeat. Right? We're performing a simple difference analysis between two copies of the article. Okay? There's then a potential third kind of watermarking that we should be conscious of as well, which is spatial watermarking. The way that this functions is modifying the actual spacing between sentences, between words within the sentence, between lines, between pages, between page numbers, and so on. And again, the good thing is that once we get rid of the content protection, this is again very trivial to remove by dumping most of the article into plain text, which will get rid of particular spacing minutiae preserved in PDF files. Finally, the third kind of component that publishers often use to track you is metadata. So metadata is again basically data about data. In our instances, it's things like who the article's author is, the time that it was generated, the particular time zone it was generated in, the mysterious UUID field, which we'll talk about in a second. So if you're using something like Adobe Acrobat, they again have ostensibly a metadata scrubbing tool built in. And here on the screenshot, you can see they claim this will discard document information and metadata. This is what is known as a lie. If we open the metadata of a PDF that has been scrubbed with Acrobat's own scrubber, we find that it still has UUID parameters, which we can view if we open it in a basic hex editor. These are, again, a formative list of bad things. And remember, our goal is to get rid of bad things to share the good thing, which is the knowledge. So what is this particular unique user ID? Adobe's XMP specifications which lay out the metadata that Acrobat uses don't actually conveniently tell us what it is. They say that that's up to the printer. The PDF printer can set its own UUID parameters. Best practices in the RFC specs that are there dictate that they should be at least partially a random number generator, but earlier versions of UUID used the MAC address. In fact, this is how, for instance, the author of the Melissa virus eventually got caught was that the UUID used to spread the Melissa virus on some Word documents matched some other random files that someone had uploaded online at one point, which turned out to be the friend of the guy who wrote the virus. In other words, the UUID is dangerous. Adobe's specifications did not dictate that you needed to use the latest UUID implementation, which is a random number generator. So in theory, any potential PDF printer that you use could be using UUIDs that will, again, allow trader tracing. So in other words, they need to be taken care of. If you're editing your document in something like Adobe Acrobat, they will not be taken care of, even if you select the script tool, which means that to go back, you need to go into the document in a hex viewer and actually remove or modify the parameters there. And, of course, this was all talking about if we want to modify potential metadata, right? We would open it in a hex editor, change the time zone, for instance. If we simply wanted to wipe the data, in other words, we didn't want to insert spoof data, but we wanted to simply erase it all. We could use very easy tool known as the metadata anonymization toolkit where you feed an input and it produces a cleaned output. The problem with simple wiping, of course, is that then your adversary will know that the data was erased. In other words, they will know that you are privy to the modifications, right? So if you have the actual time to go in and start modifying values instead of just erasing them, that will potentially lead the adversarial down a goose chase rabbit hole. So at this point, what we've done so far is we found sources where we can procure articles. We've discussed how we could remove protection from the articles. Now, how can we finally share them? The first way of the very fundamental principle would be not to use your own IP, not to use the IP of any university you may be affiliated with and to use Tor, but of course, not to use Tor from your university network because then it would, of course, be obvious if you're the only person using Tor at that given time and there's a Tor upload that matches that time stamp. So in other words, not to use your network entirely. The second thing to do is to wait. So let's say you purchase a book from Amazon at five o'clock on Friday and then you put it on Libgen at five o'clock on Friday and let's say you do this over and over again. Amazon may very easily then conduct time correlation attacks because Libgen, of course, preserves the file upload date and time. So the second thing to do other than not using your own network is to wait before you upload stuff to be able to spoof file correlation attacks and you may further be able to spoof these by again modifying the data within your document. So if you downloaded something on Wednesday the fifth you could change the metadata to say you downloaded on Tuesday the fourth or even earlier or potentially even later. And then finally, you could use various file hosting solutions which are more or less friendly to the type of content that we want to share. Some of these are the following. Okay, and that's pretty much it. Now what we do is open it up to questions but the last thing I want to say is remember that this is serious business, right? This is why we started off with a formal disclaimer is because people are getting arrested for effectively simply sharing information. Okay, so in other words, be safe and be careful. Okay, when you guys do this and remember we are at war at the time that I'm speaking right now Elsevier, one of the biggest publishers has filed a John Doe lawsuit in New York against SiHob. LibGen is also under attack in that for instance the high court has recently blocked it in the United Kingdom. So these are very serious issues. I may have addressed some of them glibly as a way of getting them across but keep in mind this is very serious business. Okay, thank you guys and if you have any questions. Okay, thank you very much. We do have time for questions. I would ask everybody to please line up at the mics. We do have a question already. Please go ahead. I do have a question. Your injunctions to steal books from the library is very strange. In particular it violates most ethical positions in the Golden Rule and it ignores the fact that librarians are very protective of patron privacy both on a historical basis and in individual cases. If you heard Brewster's talk a couple days ago he talked about the national security letter which the Internet Archive got and fought. Many libraries have a long tradition of resisting law enforcement demands for patron records. So I think you're deeply misguided in suggesting that people steal books from libraries. I'm sorry, I'm deeply what? Misguided. So that wasn't really a question I suppose but I will respond to it in kind. Anyway, so to address your first point about the fact that many librarians are protective of user privacy, librarians can be served with orders where they're not allowed to state that they have received orders to turn over loaner records. That's a fundamental fact of at least U.S. law. However, even if that were not a fact, putting trust in another entity increases the entropy in that if you're trusting the librarian not to hand over the records, that bridge does not need to be there if you simply take the book in the first place. In other words, you are putting yourself needlessly at risk. And point out that you can also read the book in the library but not borrow it and create no record of the book. Or you can photocopy it with your phone without any record of your being there at the library and not deprive other people of this library resource of the shared resource. Okay, let's back up and take your point one by point. So yes, you can read a book in the library assuming that you have physical access to the library. What we are fighting for is making knowledge globally accessible to people who do not have the privilege to be in a particular building. Second of all, to address your second point about taking a photo or photocopying or taking a photo with your phone, yes, you can do that if you want very crappy low resolution images. If you didn't want to take the book from the library, a more prudent solution would be to use their fancy scanners. But going even further than that, you seem to be assuming that in the action of taking a book from the library, the book would have otherwise not been taken out but what of traditional patrons? They took books out from the library. The difference is that when we do it, we put the books online for anyone to then see and then we can bring them back to the library as opposed to a general patron who takes the book out, reads them for themselves a fundamentally selfish act and then brings them back. So I don't particularly see the problem unless you're assuming that we won't put the books back when we take them or that we won't put the books online, which are the two fundamental imperatives of this mission. Yes, in order to steal a book, you need physical access to the library. Instead of stealing the book, you could read the book, you could photocopy the book. The quality of the book, the quality of the photocopy that you make is unlikely to be noticeably different for usability than, of course, you could steal the book and I don't even know how to go into this. I'm sorry. I think we do have some more questions. Thank you anyway. Please go ahead over here. If you have a massive quantity of PDFs you want to change the metadata on, do you have any recommendations for tools to batch process? So the question is kind of whether your intention is to actually modify the metadata or simply to wipe it. If you simply want to wipe it, that's a lot easier too. Do you where you could simply batch process using the metadata anonymization toolkit, which can batch process and wipe out the data? If you actually want to go through and spoof the data in every single one, at this moment there's not yet, unfortunately, any tools available to automate that. They're being worked on vaguely in people's free time, but unfortunately, I don't know of any to do mass scale spoofing at this point. Thanks. Okay, thank you very much. The other side of the room. You mentioned two potential problems to solve, the thing with the green cupcake and the thing with the spacing. Have you seen either of those problems in the wild or have you heard of it? Yeah, that's actually a very good question. So I've looked at, I think something, by now something like 20 major publishers of articles, none of them use these systems presently, but these systems do exist. My assumption, it's only a matter of time before they're widely adopted, but that's a very good question. In the wild, like I said by now, in the size of seven by now I've done more like 20, I have not actually found that in the wild anywhere yet. So these are at least point only theoretical attacks and counterattacks, but I think it's good that we start thinking about them earlier, as long as they also don't scare us into inaction, but promise to your action to removing them. Okay, thank you very much. Thank you. Are there any more questions in this? Please get up to the microphone if you can. Right here in... Yes, okay. Please go ahead. It's a bit off the original purpose of the presentation, but in terms of complementing this data liberation strategy with a strategy that also embraces the authors of the publications that are to be liberated themselves, I think a lot has been done historically for quite a while with the social convention of a pre-publication or a pre-formatted copy. There is a law in the UK that requires researchers to do that, and I believe to some extent you can put it on your university website and academia.edu and research gates are, I guess, two platforms that are helping to get around it. I'm also just a bit concerned that maybe the guerrilla tactics on the one side are not possibly going to win the favour of authors. It's different when you come to journals. I mean, I don't think many people get as pissed off when journals are pirated as... We are very short on time. Do you think you can answer that? So just very briefly to address your question, there are things like academia.edu and research gate, which are, again, legalist modes of praxis where authors can voluntarily put content online, and we should absolutely use those, but my point today is that we shouldn't limit ourselves to simply those kinds of legalist modes of attack. In other words, we should certainly use the authors if they're willing to join us, and we shouldn't restrain ourselves to their consent. Okay, great. So I think we have time for one very quick question, please. I simply wanted to add to the previous speaker that I read a lot of academic literature, and the US universities normally put things online nowadays on the open internet, whereas in Germany the problem is very strong and publishers don't allow you to put anything on the internet if you publish it in a magazine, so I think this problem should be addressed at a future conference. Thank you. Thank you. That was more like an annotation than a question, so I think we're good, right? Sorry? I think we're good. Yeah, I think we're good. And thank you very much, Storm Harding for a great talk.