 excited to have with us here, John Penny, who is a recent Berkman fellow and then research affiliate. John is a lawyer. He has just completed his doctorate at the Oxford Internet Institute at the University of Oxford. He's also a research fellow at the Citizen Lab and the Monk School of Global Affairs at the University of Toronto. He's here today to tell us about his doctoral research, which explores the regulatory chilling effects that may take place online. And one of the reasons that we're especially excited is that in part his research relies on data that he found in the Lumen formerly chilling effects database, although it spans a variety of sources, which he's going to tell us about. I should also mention that John is affiliated with the takedown project, which is a research collective studying notice and takedown who have also made use of the Lumen database. But without further ado, I want to let John get to his stuff. John, thank you so much for being here. Thank you, Adam. You know, great to be back at the Berkman Center. Always just thrilled to be back. And also thank you for you guys for coming today. I know that it's such a lovely day outside. You know, the sun is shining, people are smiling. What a perfect day to talk about government surveillance, right? So, Adam's right. I'm going to be talking a little bit about my research at the Oxford Internet Institute. There's the title of the talk. I realized in the description of the talk that I intimated that I might talk about the entire thesis. I think I would be torturing you to fit the entire thesis into this next hour. But I do hope to at least focus on two of the sort of key case studies that I have in the doctorate. The doctorate, as you know, focuses on this notion of chilling effects. Now, I'm guessing that most of you've heard this term before, have a sense of what it means unless you're a refrigeration professional and you have an entire different idea of it. But chilling effects, essentially this idea that laws and certain government actions might have a kind of chilling effect or deterrent effect on certain activities. Now, of course, all laws might have a chilling or deterrent effect, but chilling effects in and of itself is mostly concerned actually with legal activities, activities that are both constitutionally protected, that are even desirous. And so the question is, do certain laws or government actions like surveillance, how does that impact on those kinds of legal, protected, and even desirous activities? And that's really the focus of my doctorate. Breaking it down, I have Shower Up Here, which is one of the first sort of legal philosophers to turn chilling effects into a sort of a comprehensive theory of understanding laws under First Amendment doctrine in the U.S. But surprisingly within the law, and I'm a lawyer and have a legal background, I teach law, there's been a lot of skepticism amongst courts and lawyers and empirical legal scholars on this question. So I've got two quotes up here, one from an older case many decades ago, Larry and Tatum, which involved military surveillance, and some, it was a constitutional claim, and in the case about chilling effects. And it was dismissed on standing ground. It's not a cognizable injury. Another more recent quote here from some litigation arising with respect to NSA surveillance online, claptured to the Amnesty International, where chilling effects claims were also dismissed for being too speculative. So a lot of skepticism amongst lawyers and judges on this point, but also amongst researchers and scholars from a variety of fields. So this is actually a quote from a piece by Kendrick, looking at chilling effects, the theory, the idea, and she essentially concludes looking at all the data or all the research that's been done. It has a flimsy empirical basis. Now granted, there have been a few studies since her piece that I think have provided some additional pieces, and I hope some of my research will add to that puzzle so we've a better idea of chilling effects and their impact. And some of the questions that have been raised, I mean, whether they exist, the magnitude and persistence of chilling effects, what are some factors that might influence them? These are all questions that remain unsubstantiated that still need to be addressed with systematic research. Some of the challenges why you have this dearth in research is how do you prove a negative? So how do you prove self-censorship? How do you show that somebody would have said something or done something but for this law or but for this surveillance, right? Thankfully with the internet and some other innovations and events like our friend Edward Snowden, some revelations provide with a research opportunity to get into a little bit more on that in a moment, but also the work of people like Adam here at the Berkman Center on the Lumen Database, which has been gathering DMC notices, which also provides an opportunity for research that you wouldn't have had in another time or at another time. So here's just the general outline of my thesis. What I try to do, its structure, I try to triangulate this phenomenon of regulatory chilling effects online. So I'm looking at state action, so that's the Wikipedia case study, which I'll be talking about first, but I also want to look at a statute that has been criticized for potentially having chilling effects and that is the DMCA, Digital Millennium Copyright Act. So those are the few case studies I'm going to be looking at today. The first case that I want to start with has to do with NSA, Prism Surveillance and the impact of that on Wikipedia traffic. So coming back to Snowden, what Snowden revelations in June 2013 really provided an opportunity to investigate some impact of government surveillance on people's activities online. So you had widespread publicity in June 2013, and there's an interesting study done in 2014. The report came up, the actual survey was done, I believe, in late 2013, where 87% of Americans surveyed in this poll had actually heard of this NSA program, Prism. So pretty deep penetration in terms of knowledge about an important public fact. Another study is great to have Alex Marthews here because there's this great study done where he co-authored with his partner, Catherine Tucker at MIT, done in 2014, which actually looked at Google search, it actually treated June 2013 as a sort of an exogenous event, and they looked at Google search data post that time, and they actually found a 5% decrease in certain sensitive or embarrassing Google searches amongst other very interesting insights. And so reading that study actually got to thinking, like maybe Wikipedia, that might be another site that could have been impacted in similar ways. And if you have a chance, go take a look at that study. It's a great study and I rely heavily on it in terms of my research design. I actually was pursuing this research question long before this happened, but it's one of those things that happen when you're doing research, which adds a whole new public dimension to what you're doing. So in March of 2015, Wikimedia Foundation, the ACLU, bring a lawsuit based on certain constitutional claims against the US government, the National Security Agency, and the US Department of Judgment, Department of Justice, essentially asserting this kind of harm to Wikipedia users. So the harm to Wikipedia and the hundreds of millions of people who visit our websites is clear, pervasive surveillance has a chilling effect. What my case study essentially does is test that claim. So a little bit about my methodology and design. So I treat June 2013 as a kind of intervention, and I look at Wikipedia article traffic before and after I use what's known as an interrupted time series design. And I also add a comparator sort of quasi controls to that to make it more robust. I also use segmented regression, that's my method of analysis. And the term of the period of study is right up here. So the data comes from January 2012 to August 2014. And essentially, and this is the other great thing about Open Data Platforms is Wikipedia's great Wikimedia Foundation offers data very openly about its services, its platforms, that kind of openness made this research possible. And I hopefully have a chance to roommate it on that at the end of my talk. But I basically constructed this data set based on this Wikipedia traffic. So what Wikipedia articles did I include in this study? So following the lead of Matthews and Tucker in 2014, as a starting point, there's no sampling frame, there's no frame of knowing what all sensitive content on Wikipedia might be. So what I do is I went to this Department of Homeland Security document, which is not a whole lot of information out there about what it's used for, but it's likely used for monitoring certain keywords online. So the Department of Homeland Security DHS monitors these keywords, they have certain categories of different keywords. I think it's to monitor statements on different online platforms to search for national security threats, that sort of thing. So they have a category of keywords relating to terrorism. Because there were a few media studies done about coverage of the Snowden revelations, a lot of the media coverage and news coverage focused or framed it as, you know, these surveillance was done for national security, your concerns are to track down and surveil potential terrorism threats. So terrorism was a part of the media narrative during that journalistic reporting of the Snowden revelations. So what I did was I took 48 Wikipedia articles, which corresponded with the 41 terrorism keywords in the DHS article, the assumption there is not that people have seen this DHS set of keywords, it's just a starting point for content that I want to track in this study. Now it doesn't seem like a lot, 48 articles, but actually that represents over 81 million article views over the course of the 32 months. We're talking a lot of people that might be captured by this data. A little bit more to make it more robust. I also use some mechanical turquers to do essentially a privacy evaluation of these terms. The aim here is just to ask whether these terms or this content within the study represent the kind of content that might give people or internet users cause for concern if they knew the government was monitoring their activities online. So let me give you an example of one of the questions that I asked, and this actually tracks similar to what Alex and Catherine did with their study. So if you knew the government was monitoring your online activities, how likely on a scale of one to five would you be willing to avoid sort of content based on this keyword. So one example of the avoidance rating, the avoidance rating on average was 2.62. And if you go through the terms, some obviously have raised greater privacy concerns than others. So Wikipedia articles on Dirty Bomb, Carbomb, Jihad, those had higher concerns, maybe lower for just certain countries that we might associate with terrorist events, less privacy concerns. But on average, what the survey shows was, yes, this was content that would give some internet users cause for concern. So there is a privacy factor here. All right. So what were my findings when we put it together? So here's my first set of findings. I have a few sets of findings because I think there's some insights that come out as we go through them. Here's the first set of results. So what you're seeing here is a graph of the model results. So on the bottom you have 32 months in the study, starting at the beginning of the study, the line cutting down through that the interrupting event, that's the June 2013 Snowden revelations. So what this is actually doing is looking at trends in the views of these articles before and after this date. The lines that you see on either side are essentially trend lines for the data. So it's basically giving you an indication of the trend based on the data. This was actually what I was expecting on the basis of a chilling effect hypothesis. So if people are chilled or concerned about governance surveillance that maybe over the course of that June as they learn about it, maybe there would be a drop off in views of this kind of content. But over time people realize that of course I'm not going to go to jail for viewing content on Wikipedia. Why would I be concerned with? So you see on the other side there's a drop off but then there's not much of a change in the trend and the data here. So it's just a gradual continuing month to month increase in article views of this content. So there is a statistical significant drop off in the middle over June but it's still increasing over time. But there's an issue with these results. Does anyone spot it? It's the two extreme outliers there at the top. Yeah. So you see an extreme outlier in November 2012 and then another one in July 2014. Can anyone tell me, there's some people who know this because they know my research. Anyone here tell me what might have happened in those two months that might have caused a radical increase in the views of articles during those months? Can anyone think of a common event during those two months? What's that months again? November 2012 in July 2014. All right. I'll end the puzzle. I call this a Hamas outlier because in those two months you had two Israeli offenses in Gaza. So in November 2012 you have Operation Pillar of Defense, an ATIDF operation. The views for the Hamas article alone skyrocketed so you get a million views that month. Similarly in July 2014 you get another Israeli offensive operation protective edge. Another Gaza conflict, July 2014, you get a rapid increase. So what's happening here? So what seems to be on these particular dates which is consistent with some other research done, Brian Keegan, it is some really interesting work in Wikipedia he's affiliated with the Berkman Center and some other work done and other social media platforms is that likely a news media event is bringing maybe a population of new users. There's a media event. People come to Wikipedia to learn about it. So there's prior research that shows with certain kinds of media. It does have an impact on people's uses of Wikipedia and Wikipedia content. So people are going to edit after a news event. So maybe that's what's happening. But just keep in mind, I'll just flip back, keep in mind that the second part of this graph here where you see it's still increasing, the trend continues gradually. So what I do is I remove that extreme outlier and it provides us with a more clear picture of the actual trend in the data when you remove this extreme outlier. So here's my final resource and I'll have the graph up into the moment. What I do is I focus on the 31 articles which had the highest privacy score. So the ones with the most privacy concern based on that survey. I add a security related comparator group. So that is I mean the difficulty here is that if you've got a control group that is really similar to your actual treatment group, that is the terrorism related articles, then it's going to be captured or impacted by the same stimulus. That is if you're looking at privacy concerning content then you're going to stop viewing it thereafter. So I thought security related comparator group that is essentially is a group of Wikipedia articles that concern government agencies that deal with security related matters. There's going to be some overlap there just to show the difference between trends with these two different groups of data. What I find is that there is again a statistically significant, highly significant drop off over the course of June of 262,000 views, but the trend in the data changes. So before you've got a gradual monthly increase of 34,000 views month-to-month, but after you actually have a complete shift in the long-term viewer trend. So now it's decreasing 44,000 views per month. So here's what it looks like. So at the top, so same graphing but let me just explain a little bit more of what you're seeing here. So same interruption of the time series in the data set. The top lines you're seeing there, the darker lines, that actually represents what we're focused on the group of 31 terrorism related content which had the most privacy sensitive articles. As you can see it's a little bit more of a steep incline so there's more viewers leading up to June 2013. There is another drop off over the course of that month, but differently here you see the trend now is that it's constantly declining all the way up until August 2014 at the end. So by contrast at the bottom you see this is my comparator group. So this is the group of security related articles which have sort of a gradual, almost a constant sort of level of viewership over the course of the study. There's a slight not statistically significant drop-off in June 2013 and there's really no significant change in the trend. It just seems to continue on post June 2013. So you really see the contrast between the security related articles and the privacy concerning one surrounding terrorism related content. So what are some implications of this? If I'm right that this is evidence of a chilling effect it seems to me that it's not just what I was expecting based on a chilling effect hypothesis of an immediate or sudden sort of chilling effect over the course of that month but maybe something that I wasn't expecting and that contradicts some of the research that's out there on chilling effects that there might be a long-term chill that people just over the long term they're not realizing that no one's going to jail they're just concerned with being flagged by government being characterized or being caught up in some data sweep by government something like that that might indicate it might also be constant roll-up of awareness of the data that is as more people learn about the Snowden revelations fewer people are viewing this content. I think the better explanation might be the first one because of the high penetration of knowledge based on that Pew Internet survey that as of the end of June 2013 87 percent of Americans already knew about it. I think there's some interesting insights as to the impact of war on chilling effects so the Gaza conflict seems to ameliorate the chill if that's what's being represented here that might it might just be I'm concerned about government surveillance but I'm really interested in what this global event is so I'm just going to view that data that's one explanation another might be just the Gaza conflict and this is consistent with other research and other platforms. Gaza is a unique event that brings new users to platforms that they're just interested in learning about and talking about Middle East conflicts in Gaza so that might be another explanation for that data. I think it has implications for legal standing and the constitutional litigation not just Wikimedia but others and of course we can talk with this a little bit later when we get into Q&A. Surveillance democracy and access to knowledge given that Wikipedia is such an important and popular tool online for people's use of basic information so that is one study I'll try to tie in these two there's of course limits with this study the data is limit I'd like to use going forward because I want to build on this study use a more sophisticated person my comparator group collate more data sets to make the analysis more robust and I'll talk a little bit more about those limitations later because I do want to talk before I'm done on the DMCA case study. So this actually involved 500 Google blogs and 500 Twitter accounts that receive DMCA notices I'll explain a little bit of moment what that means and for those who are less familiar with the DMCA copyright scheme for today I want to focus on the just for the sake of brevity and time will focus on the blogger side of it the 500 blogs that received DMCA notices so the DMCA the digital money of copyright act it's a 1990 statute that aims to police or enforce copyright on the internet how it essentially works with a very handy EFF graphic here so let me just explain this in the context of say Google blogs because every platform is going to be slightly different excuse me so you have the right to so say that I'm somebody that's posted a video online I own the copyright in it I see that some blogger on Google blogs has taken my video and posted it without my permission onto their blog I send a DMCA notice and this is all automatic and send them electronically online I send a DMCA notice to Google blogs to inform them that one of their users their bloggers has posted without my permission the video that I posted online Google then usually disables the content in the context of Google blogs they'll often take the Google blog post make it put it into draft form that gives the user an opportunity to edit the the blog add different content something like that and they'll of course inform the user that they receive this notice the user then has an opportunity to either repost the content or file what's known as a counter notice and that is essentially saying no I actually have a right to post this content it's entirely legal I have a legal defense like fair use and if I file a counter notice then the content can be replaced within the next 10 to 14 days that might lead to a lawsuit this scheme has been described and has been criticized by volumes of legal scholarship essentially saying that is it's has a chilling effect on activities online so the idea is that there's a lot of legal content it's being captured by these notices and I can tell you that they're literally millions of these notices being sent out every single day Wendy Seltzer one of the original Berkman fellows here has described this I think quite accurately or at least I'm testing this claim I think my results back her up she calls this a chilling effect architecture that is it's a regulatory scheme that really favors a chilling effect on users as against free speech and other kinds of activities online a little bit more about my methodology and design of course so what I did was I've got a random sample of 500 google blogs my sampling frame a little bit older with what I was doing the data said it was earlier in my doctorate so my first case study that I completed June 20 January 2012 to July 2014 so it was sampled from them each recipient blocks every blog they received a DMC notice and the notice so I visited each and coded for a range of variables including whether the content was online offline whether the blog was suspended or locked locked is more for Twitter where you protect your Twitter account it's very easy because in the DMC notice you've got a often a URL to the content that's being targeted I also looked at and coded for potential legal defenses for example fair use and also had follow-up questionnaires to essentially get some more granular data basically asking so why didn't you repost what was the reason and so a blogger to say well it's because I was I didn't want any trouble with the law even though I think I had a right to do so what were my findings on this so just to give you a sense of what was in that sample so these were the blogs these are kinds of blogs that were actually targeted in the sample so quite a broad range so the largest being actually a category of other was hard to really categorize it was a little bit unclear on the face of it of course you get a large categories of like cultural business there's of course adult and spam in there I was expecting to be more of that in the sample but I think Google blogs you've got a lot of people blogging there so you get actually a lot of text you'll see in a moment where people are being targeted so a broad range of blogs being captured also in terms of the content being targeted also a broad range again because I think blogs are a text heavy medium a lot of the content that's targeted by the notices in my sample was actually text text-based sort of claim so excerpts from other articles and news were actually targeted by DNC notices the next larger sort of category you have images you've got video and you've gotten mixed so a broad range of content being targeted here I think if you look at other mediums it's going to depend on whatever medium you're looking at so what was the actual impact of these notices so here this is a what you're seeing here this graphs essentially of the 500 notices or blogs they received of the 500 blocks 88 percent of the content targeted by the notices 80 percent when visited was the content was offline or inaccessible 12 percent was accessible and still online so that's pretty significant at these notices so it's targeting a lot so a lot of the content that's being targeted is now offline breaking it down a little bit further and it gives you a little bit of indication of what maybe this content is so of that you still have the 12 percent that's online but this gives you a breakdown of what's offline so 43 percent so the the strong percentage of content offline the blog is still there so the blog still going still exists just that post is no longer there right or the content has been removed so the blog post is there but the the image that was there is removed right and there's the sort of like a image removed thumbnail you'll see you've got another large percentage 32 percent the blog is suspended i think these are the pirates because that suggests that you've got a blogger that's violated terms of service more than once maybe multiple dmca's have been arrived and so you've got probably a lot of pirates captured in that 32 percent then you've got another interesting category 32 percent that's sorry 13 percent where the user has deleted their blog or they've relocated it um that could be somebody who's a pirate and they they're like the gig is up i've been found out now that i've received the dmca notice or more worryingly it could be somebody who's received this is so frightened about being sued they've just shut down their blog right and that would represent a pretty significant kind of chilling effect uh at that all future speech based on this blog might be chilled they're of bloggers who made their blogs private rather than deleting them or uh or having them suspended or whatever right right so so that would be so i actually categorized that under uh the blog is there but the content is now offline sorry sorry this was kind of what's usually to relocated also locked right so under the 13 percent right so the blog still there it's locked um that was a smaller percentage in the blog more in twitter so you had a lot of people that received dmca notices and they just make their twitter feed protected after that date okay so let me break this down a little bit for the only a few more results slides and then we can get to q and i want to leave at least half an hour to talk about this because it's great to get you guys thoughts on this so um what i have here is a cross tabulation so it's the same sort of breakdown you saw a moment ago um but cross tabulated against potential legality of the content so what this means is um the first sort of um pie graph that you see there that's the content that's online so the when i the content that was targeted it's online when at the time of the study the second one is the content is offline so 72 percent of the content that's offline but the blog is still there um that 72 percent unlikely any kind of legal defense however you still get 17 percent of a possible but unclear legal defense like fair use and then even in a smaller but still i think substantial percentage 11 percent they likely have a legal defense so the content is likely legal some extent further down in the other subcategory of user deleted so again large percentage likely it's copyright infringement so 75 percent but still in both the blog suspended and in the user deleted side you've got a substantial percentage there um of content that likely or could potentially be legal in the circumstances so the concern here is is that following along with Wendy Seltzer's idea is this a chilling effect category is this a chilling effect architecture it seems to be right that users receiving these dmc notices are chilled and they're not replacing the content they're leaving the content offline even where in a lot of cases their content is likely legal or there's at least a good case to say that it's likely or possibly or at least a good argument that they have a legal defense for so there's a lot of legal content being captured by these notices some additional findings here I found a modest inverse relationship between the strength of the fair use or legality claims and whether the content was offline put more simply there was a statistically significant association between how likely the content was legal and whether it would be online that was similar with twitter as well you might on one angle of view conclude that this means maybe that the dmc is actually working to in this sense that is more likely that there's some legality there the more likely it's going to be offline if it's less likely to be legal more likely it's going to be offline more likely to be legal more likely to be online that's one angle but looking at it from another angle of view you might say well that's true but it also seems to be capturing a lot of potentially legal content and expression online and people are being chilled when receiving these notices I found no evidence of any counternotice and this is consistent with findings and other studies that counternotices are very very rare and finally that I thought this was really interesting and I call it I don't know what the call the phenomenon I call it stock home syndrome I'm not sure what but amongst the blogger there's only 15 instances with a blogger on their blog speaks of or mentions out of the dmca or copyright on the tourside only 12 instances and in the vast majority of cases they're either neutral in their tone about the dmca or copyright or in some cases favorable about it how do I explain it maybe this is um thinking that Sonny Katyal's work about copyright surveillance if you see him and notice you think somebody's watching you so you may as well say you know I'm going to be for here right now and I'm going to speak favorably about this because I know someone's out there watching me hard to explain but that was the finding maybe it's an example of a notice it's not just a chill receiving a personalized legal threat with the notice but also the idea that there's people out there watching what you're blogging leads to a kind of data void terms maybe a content collapse people are sort of shifting their their voice their identities their approach to certain things I'm not sure where that might be one explanation of course there's limits with this as well my data set was limited to 2012 and 2013 I had a low response ran into the questionnaires getting people to talk about this very difficult complications with the legalities and there's a lot of assumptions built into my legal work because a lot of the blogs captured in this study were also international I have some graphs I don't need to put them up now because I want to get to discussion a lot of blogs that were international and in ways you would argue that the DMC has become almost an international copy enforcement policing statute for the world through automation and through implementation online and the fact that this is really a study that's focused on Google blogs and on Twitter and on the fact that other platforms would likely lead to different results because every platform and this is one of the ways that you can might critique the DMC every platform is going to be dealing with these challenges in different ways tying these things together overall in terms of my overall implications here and let's say a few final things I think both case studies suggest both the existence and potential persistence of regulatory chilling effects online in two different concrete contexts we're talking government action and surveillance and another we're talking a chilling effect related statute that is policing legal norms online one is government related but the other is enforced often by private parties enforcing their legal rights and both can lead to very similar kinds of chilling effects also this subtle conforming effects of these regulatory regimes avoidance of certain content in Wikipedia content left offline in the DMCA case or the Stockholm syndrome that I mentioned there I don't think it I think they also there's no really one single overarching theory to understand this phenomenon or effect I think in some cases it's going to be a threat of real legal action and penalty and prohibition that's the DMCA case these notices are personalized legal threat and people are concerned about real legal repercussions but I think if my evidence is right about the Wikipedia study the explanation there has to be something a little bit more less to do with concerns about real legal punishment and more concerns about Daniel Solov talks about sort of broader concerns also environmental pollution this idea that surveillance is like a pollution you don't want to be caught up labeled or targeted or categorized as a non-conformist or as a threat to the state and finally I think this provides a real window into the potential scale of the impact one billion takedown requests per year under the DMCA regime in the Google sorry in the Wikipedia case study we're talking large numbers of people viewing this content I just looked at one subset of content you could see this maybe reproduce on other platforms one platform even though it's very popular what's happening other sites Alex and Catherine saw a similar result with Google search so I think that gives a window into the potential scale so I'll leave it there and I look forward to your questions or comments but thank you so with the first study with the Wikipedia I doubt this would really change the result but I was just curious if you looked into maybe normalizing some of the data against either Wikipedia traffic in general okay so I realized I didn't have a chance to mention but actually that my aggression does control for all Wikipedia traffic to English Wikipedia so it includes traffic to mobile desktop sort of all platforms so that's part of one of the slides that died as I was trying to skin this down but thank you for the question right so this does control for background Wikipedia trends and there is some different trends so mobile traffic is constantly on the rise desktop traffic depending on the year is you know gradually decreasing neither really can explain what's going on with this content Ryan and then Alex this was all super interesting and hard work that you've been doing so congratulations on doing this and the attention that your work has been getting I was wondering about the trend lines obviously in the Wikipedia data you know there's that big change in in June 2013 but just like how when you saw that the Hamas news stories were creating peaks and valleys on certain terms I was wondering how much there was noise like that on all of the other terms that you were using maybe not to such an effect but like a pipe bomb goes off in Cairo and so people search pipe bomb and that peaks you know for for one you know a few weeks and then right and then that goes down and you know and so whether how you accounted for for sort of that general peaks and valley if at all when thinking about you know that like maybe in the you know you know 14 months after you know if there were just fewer news stories and you were getting about pipe bomb right that was you know simply and so so you know whether there was any attempt to match up search terms with other news stories sure so I think the first approach I mean the the key approach there is you're looking for outliers in the same way with like you say with that Hamas article right you're looking for instances of data points or elements in your data set which is having an outsized influence on your results and it was quite clear with Hamas I can say there were other outliers in the set for example Palestinian liberation organization was an article that received during the same period of time November 2012 July 2014 was another article that also had an escalation in views but it wasn't so much it wasn't the difference wasn't enough for me to justify excluding it and actually if I if I switch back to the original graph you can see some influence there and I think even here you can see an influential data point just before June 2013 I didn't really investigate that Keith is looking at diagnostics for the model and if you find something that really is having an outsized influence investigate it try to understand it and providing as I said before results before and after I think the difference here with my final results here is it overall compared to the initial results there's a lot more noise and at first set when I focused on the more privacy sensitive terms in the 31 you can see here look how tight the data points are around the trends here both for the terrorism-related content but also the more security-related ones there's a little bit of noise in various places but I think there to make sure that what you're showing here in terms of the results in terms of the expected predicted values of your results that there's no real outsized influence there's maybe an outlier there but it wasn't extreme enough to justify excluding but and also thank you for the kind terms at the beginning right thanks so Alex and then Nate first I should echo that I think this is very interesting work and it's really great that you've been trying to put it together and I think focusing on Wikipedia is an excellent idea the main question that I have relating to that study deals with the list of security-related terms perhaps you could talk a little bit about the choice to develop as a comparator a list of security-related terms rather than say a random sampling of Wikipedia pages on general topics and secondly if you could go into a little more depth on how you chose the security-related Wikipedia sites sure so and I think this is this is one of the areas of the study that I'd like to move moving forward to have a a little bit more of a sophisticated approach to matching with the with the comparator so how I came to choose the security-related so it was basically on two one I wanted to based on two factors one was avoiding bias right so I don't want I want to avoid the bias of people accusing me of you know basically selecting certain sort of articles to include as comparator I considered a random sample of articles and here's why I think that wouldn't work and that was sort of the feedback that I got is that if you just sort of gather a group of random Wikipedia articles that's probably going to track certain trends so there's going to be subcategories of content that I think that you see in Wikipedia that different events are going to influence right so if I added a comparison here of say Wikipedia content of random articles or Justin Bieber for example if I showed his sort of Wikipedia hits over the course of the same period it would give you some sense of some other trends but I think the critique of using that as a comparator would be is that it's too different from the terrorism related content such that whatever trends you're seeing in the my focus here being the terrorism related content and my hypothesis of a chilling effect sort of impact would be too different from the random randomized sample that would do otherwise so I guess you could say my choice was using a sort of a normative matching approach and the challenge here and I sort of hinted at this when I discussed my methodology is that we can't have I can't have a perfect control group here right because it's not an experimental setting ideally you would have people who were exposed to these stimulus here the surveillance you can isolate them from those who also were viewing Wikipedia articles on terrorism related content you can isolate it that way in a perfect experimental setting just get the pepper yes let's talk about that after exactly unfortunately here everyone is potentially subject to the stimulus and you saw high penetration they're 87% so you need to find a comparator group that is close enough to terrorism related content but not so close that it's going to attract the stimulus such that it just looks like you have the same sort of trends here and you can't tell whether there's a chilling effect or you just picking up some background trend I thought the security related articles made sense one because again it's I'm voting bias I'm just using the the same sort of set of keywords from the same document but just security related I figured there would be overlap between people so there would be overlap between people viewing terrorism related content prior to June 2013 and might be also if they're interested in that maybe they're viewing stuff on the CIA and the Department of Homeland Security right but post June 2013 they wouldn't be so concerned about viewing those same sites in the way that they would be viewing car bomb dirty bomb so there might be some overlap but I think that's the challenge unfortunately with this kind of design and a non-experience very quasi-experimental and so it's like that was the the best justification to come up with for the comparative group but I think going forward I like to employ a more sophisticated matching propensity matching this sort of thing moving forward and as I correlate other data sets and we're great to have that conversation with you because I know you you you guys do a good job of it but I think the idea here was get something as close to possible to that content but not close enough that it's going to be going to attract the same impact so you could see the difference here Nate I know we've talked about this before and I really appreciate seeing this all unfolded that's really nice but I'm curious you set up at the beginning actual opinions from judges okay saying this thing is not like meaningful or substantial or in various other things like how does something like this for does something like this ever have impact on what happens in the courts and what does that path look like what does that path look like well I think absolutely what I mean one of the challenges often with litigation and in particular constitutional litigation that a lot of these constitutional claims with respect to surveillance in particular as you saw that early case decades ago in 1972 and later in Tatum there's often been concerns about whether how to prove these kinds of claims right often they're described as subjective speculative not cognizable injuries and so I think built into those assumptions there's some legality there a lot of these claims fall on standing grounds so the idea that you actually don't have standing to show that you have you've been injured to sue on the basis of a violation of your first or fourth or amendment rights before the courts and that's essentially among some of the grounds that Wikimedia Foundation along the ACLU have launched their lawsuit they actually lost at the first instance in the federal court essentially on standing grounds amongst others right and so I think having an empirical foundation I'm hoping and I think that having an empirical foundation for these kinds of claims showing that chilling effects are not necessarily speculative they're not merely subjective there's actually an objective basis for this kind of concern certainly I think it can help the Wikimedia litigation but I think it could help some of the other some of the other surveillance-related litigation out there as well there have been and I know this in other related litigation there's some great survey studies that have been done relating to chilling effects done by some Berkman Affiliated researchers at Pew for example that have asked important questions about how you know asking respondents how your knowledge about the surveillance revelations do do anything how's that impacted your behavior I think all of that provides a great insight and that that has been cited in briefs that are that have been filed with courts before right I think they've been successful in some cases in persuading courts is I think some of the cases out there it's a bit mixed some have succeeded to proceed some have failed on standing grounds and I'm hoping that this study adds another sort of angle another data-based angle to provide more empirical foundations for those claims sure so we can imagine like I'm hearing from you maybe two mechanisms sure and one is like an impact through methodology where someone thinks that they've experienced the chilling effect like their platform they can do research like yours to show that they have a standing now like we did novel research and we found it and we used Penny's methodology another is like through citation where someone says I've experienced what Penny said happened on blogs in Wikipedia and we know that it's a real thing because it happened to other people and then there might be a third category which is going to lawmakers and saying you're about to like consider new laws about this this is one outcome that other similar laws are those like the three main I think you've actually said it much better than I could for the next five minutes so I think you've actually captured it perfectly so on the one hand some of the litigation out there which I've been thinking about because it's related to Wikipedia so there's the the litigation context there's a law making context which is the one that you've mentioned I think is really important and yes the third the research so in the dynamics were the the great Google search research done by Alex Catherine which is the Google search which inspired my research and hopefully our research inspires others to do similar research and other platforms drawing on similar designs and similar forms of analysis I think all three and I think you've articulated them very well I'm hoping is the movement forward but I think maybe the easiest one in the end is just getting lawmakers on side to reconsider some of these government policies oh sorry this gentleman here I was wondering if you looked at the source of the traffic in terms of countries to see the composition of the traffic for the week period so if I looked at at the source of the traffic which countries the origin of the of the oh so right so great question and it's one of the limits of the data so I don't have so in the data that's open that Wikipedia has open that Wikimedia Foundation offers doesn't have geographically specific data so I don't know who is accessing from where I mean there is some data on I think a lot very high percentage of English Wikipedia is actually American so American that I think that's like maybe 90% I could have that wrong but I think a very high percentage of English Wikipedia traffic comes from the United States but yeah I think internally they have that data and I think for privacy reasons they don't release it understandably but I think that would be another really interesting element to add to this to understand whether it's a disparate the international impact whether it's greater or not exactly retrospectively whether the content should have been legal or not and did you have access somehow to what the blog looked like before a DMCA notice to happen so in the cases so and this is this is and you flag one of the I think that the challenges with a study like this right on the legal side of things so often it was actually pretty apparent from what was actually captured right so sometimes you'd have a very tiny thumbnail that was actually targeted by the DMCA right so and that was easily described in the DMCA notice I use an approach that used a very narrow so in terms of looking at potential fair use defenses some is very obvious where for example content in the blog post for example a tiny excerpt from an article that was flat that was targeted by DMCA notice that's described easily DMCA flag and it's no longer there but it's quite obviously on the face of the DMCA notice that it's actually that's right and I think part of it and because I used a very conservative restrictive approach the percentages that I had up there I've like there is no legal defense if I was unsure it was just described exactly what the excerpt was you could infer from the description from the DMCA notice which was something that actually just copied an entire book chapter say into their blog that's right that's right yeah and and in often cases for example the Twitter side you had for example it would be maybe a link that was going to infringing content which itself is not necessarily a copyright infringement merely linking to something but putting that aside you'd have other content in that tweet which was basically removed it was very easy because you'd visit the tweet and it'd say it's been withheld for copyright reasons there will be a copyright infringement it could if you've got other things attached to it yeah at the back I wanted to ask you if you had any thoughts about solutions for a chilling effect of this magnitude because does it fall do you think to the ISPs or the OSPs in between who are sending users these DMCA requests or do you think it falls in government to try and regulate this a little bit I think it falls on a number of different shoulders I think I mean there was just a few years ago you know a large public debate and protests and activism surrounding new statutes so SOPA and PIPPA you know stop online piracy act and basically the claims made there with respect to the DMCA was a failure right that it actually wasn't effective in policing online piracy and that we needed bigger tougher more stringent more invasive forms of regulation to enforce copyright online and I think there's a bit of an empirical vacuum in that right so on the one hand I think those kinds of debates should be informed by empirical findings and empirical studies so I think on that side it falls on legislators to maybe look at the DMCA and ask is there ways that we can rein elements of this in so that it continues working it polices a copyright online or a copyright infringement online but also maybe has some better safeguards and what does that mean what does that look like well there's other kinds of online and policing systems so in Canada for example my native home Canada has a notice and notice system which is effective actually in having a lot of content that is targeted removed but it actually doesn't lead to or sorry it actually is very effective in preventing infringement from reoccurring there's been a number of studies done on it or at least secondhand data that's provided by ISPs and OSPs online but at the same time it doesn't lead to immediate removal of the content right notice and notice means a notice is sent to the OSP or the ISP it's then sent on to the user in question but the contents now are removed it's up to the user to decide if they want to remove it or they want to continue with their activities and I think that's a better balanced approach maybe than the DMCA which is also effective in policing online so there are other solutions out there that could be considered so I think it falls on legislators I think you know Google has been a leader and being transparent about how they approach this and providing their notices to the lumen database so the research like this can be done and the sense that I've had when I've dealt with companies like Google they don't like having to police this stuff and they're looking for ways so they can do it better and so I think it is on companies to take that kind of an attitude an approach to that sort of thing and be more transparent so I think if more companies provided more data to Adam and some more research could be done I think that can that can help Adam's like no so I think I think it falls on the companies to an extent to do their best in policing this so they can balance because really each platform is going to have a different approach and I appreciate that in some cases it's going to be very resource intensive to deal with this kind of a problem but it's a model that's growing it's expanding to other kinds of legal norms like the right to be forgotten in Europe which also has a bit of a notice and take down mechanism to it as Google has been implementing it so or notice and removal of links so I think it it I think the responsibilities for researchers doing research on this to lawmakers coming with better balanced approaches and also companies doing their best to acknowledge that there are legitimate interests and competing interests on both sides of these questions and do your best to implement safeguards while at the same time policing copyright infringing activities online yourself and then here sorry this gentleman here he's had his hand up for a bit and then I'll come back okay so might actually slightly follow up to Nate's question I was just curious with the litigation from Wikipedia if they had also made economic claims to the extent I mean they're a non-profit but people donate and so maybe those views cause them to lose money was that part of it um so in that particular litigation I mean there's a range of different organizations involve the litigation I can't recall if they made specific sort of economic harm arguments but I think it's easy to infer based on the claims that were made that it was having harms for their foreign readers domestic readers having impact on Wikipedia editing and other kinds of activities so I think it's easy to inform from that that it's going to have an impact on the bottom line both of companies that might be affected so if we're not talking about Wikipedia or talking about we're not talking about Wikimedia or a platform like Wikipedia we're talking about a profitable platform it's going to have an impact likely but even with with with with Wikipedia and Wikimedia Foundation you're right it's they you know they need benefactors and they need users and if usage is suffering due to this kind of online activities that's going to impact their user base and their readership and likely their viability as a service over the long term and they'd be really sad if you lose a great service like Wikipedia they notice and notice would the user be subject to further legal action if they don't take down the offending material of the objectionable material voluntarily so with the notice and notice system so it's called notice in a very awkward name but it means like two notices are sent before before the copyright claim it can go to court so a notice is sent to yeah so let's say you've got a telecommunication company and I send a notice to a telco or I send it to google then I send a second that notice is then passed on yeah nothing's removed yes subject to further legal that's right that's right so no content is removed a notice is delivered if the activities continue a second notice is delivered and then once two notices are delivered and there's a bit of a time period then the copyright hold can go to court and then sue and obtain user records from the OSP so you have to send two notices and there has to be it's basically a waiting period so the idea behind is there's an educational period so maybe the person doesn't know what they're doing yeah exactly even if it's inadvertent you still get a chance right so that's an alternative model which has seen some success but it's also one that's been criticized to so as not being tough enough on online debate I don't think so can I ask you one more sure with regards to this research can you apply this to data streams that are sent overseas to either American nationals and foreign countries or to people overseas in Europe as supporting evidence that there's a chilling effect could you prove harm in a court somewhere could you get standing to oh to get saying using this particular research or using I mean this will help right I mean you're going to use this as contributory evidence right I mean I think I mean so every every jurist every legal jurisdiction is going to have different standards for what's a just issueable question what you need for standing and to prove harm but I think that every little bit every little bit of evidence depending on whether it's a ninth circuit or it's overseas in Brussels or something right right and at the very least I think if not in litigation as Nate pointed out earlier I think it might have it has an impact on you know privacy treaties privacy agreements there's different arrangements being negotiated now so that we had the previously we had a safe harbor agreement so that so the you privacy directives and how uh European citizen data as being handled by American Canadian companies a lot of this stuff is being renegotiated there's been a new general director that's just been released on privacy so all that's going to have an impact and I think having a little bit of evidence of the impact of these kinds of activities will influence that process as well and whether it's in courts or in amongst governments negotiating these types of agreements so I think that's that's the aim and I think I think I think so maybe they're they're watching at home today see where this has it's the rise of automated notice systems such as the content ID system for youtube maybe sorry with the rise of automated notice systems for example the content ID system for youtube did you differentiate between like automated notices and manual notices and did you find a difference in with regard to the chilling fact or with regard to accuracy so whether content was actually so I I didn't but that would be a great thing to do in a in a future study and and I'm not sure if it does the the lumen database actually distinguish that in the notices I think it's we don't we not officially we have the opportunity to tag notices so that could be possible it's usually if you're looking at a specific notice it's usually very easy to discern whether it's an automatic notice or not just because the automatic notices are pared down to the bare legal essentials right and the more crafted ones tend to tend to have come from the human being the systems you represent are usually both sent by and received by automatic systems and then they get passed along to the user in that form content idea is actually both automatic but not a DMCA system right it exists in parallel so there are yeah it's like before the it's what I would describe I would describe that as a private ordering system which has many parallels but is lateral to the DMCA and relies in part on it but the idea of looking into whether people are more or less responsive to receiving an automatic versus a personal notice is an intriguing one yeah so I mean the closest yeah the closest thing do I have to what you're asking so I did track so this is one way in which you might be able to extrapolate automated notices in the in the sample so usually looking at DMCA notice if you're just one URL that's being targeted in a DMCA notice that's going to just be a it's often a person right so what I use is a sort of a proxy for robotized or automated DMCA notices is where you had 10 or more URLs in there right so that's you've got a lot of URLs that are being actually targeting you have notices now you've got like a thousand URLs in one notice that's being targeted right so if you look at the percentage here um in my sample 54% was merely one URL 16% was two to nine URLs and then maybe in that 30% that's capturing more automated notices being sentences which is targeting a lot of different sort of data for take down what I will say is that I think today that I mean I'm working off data from 20 total to 2013 I think today you're going to see an even higher percentage moving forward it's all going to be automated and but I think it's a great question is there a difference between automated how companies that use an automated response to this and how that has a compare with companies who's doing it by hand on a on a DMCA notice by notice basis so I think that's a great question for future research sure one more yeah sure you have found in your research when you're gathering the data demographic data of who was accessing or trying to search probably more for the Wikipedia articles and whether or not the chilling effect affected certain groups more than others perhaps more that are like putting an inequality of those who feel that they have to self-censor more than right right so again I think this is a great research question so it's a great question I would be unable with this data set to do it right um even if you add the granular level in the data of having geographical sort of indications for or so some some data points on the location of different users and views right our view counts by particular geography or our country state I wouldn't have that level of granularity but I think I think you could do that but I think it's going to be more of a qualitative study right so you get a a sort of subsample and I think it I think there I think there's an opening there to do this kind of a research because I think there's a real need for it is to have like a subcategory of users that might have been impacted right and you just do like a deep dive and you do questionnaires you do long form interviews and just to understand what impact this might have had and that that's how you can import more demographic information it's going to be harder with this kind of this high level sort of data set to have that another would be to to do a survey gather your own users that would and then gather data about your users through your survey so my third case study in my doctorate actually is a survey which compares to what I call chilling effect scenarios I won't get into that that's a whole other Pandora's box if you will but in that I do with the internet users captured in my survey I do have some demographic information there provide some insight but it would have been great to have that with the Wikipedia study but I didn't have that data but I think it would be great to do that research in the future thank you so much for coming out thank you because thank you everybody else for a great discussion and some some research questions I got to get on yeah thanks guys