 There we go. Okay everyone plus the bells are tolling. That's right I'm sure that's significant for some very momentous opening to what will certainly be a titanic conversation Well, thanks all of you for showing up either in person or on zoom So it is my pleasure to have this conversation here with Ethan Zuckerman from UMass Amherst So I'm sure that a lot of folks in the room are already familiar with Ethan's great work on Basically the sort of impact of social media on society and also sort of the From the scientific perspective, how can we study social media? Particularly as the social media companies are erecting more and more barriers to the sort of free and open study of these Technologies that have such a big impact on all of our lives So so welcome everyone so in case Folks don't know me, so I'm James. I helped to co-direct the Institute for Rebooting Social Media and One of the reasons we exist is to bring in great scholars like Ethan to chat about the most important issues that are affecting the space So I think maybe Ethan now just sort of start out having a little bit of like intro conversation But we want to make it sort of you know interactive if you have questions, you know feel free to raise your hand We have mic runners here if you're on zoom You can I guess post something in the chat and then that'll get reflected back up to us so great So let's begin so just like a little bit of scene setting So as you may have heard social media is not in the best state right now. Hopefully that's not a contentious proposal It is you may know it's becoming more and more difficult as an external researcher to study These platforms in particular API access has become sort of this big battleground in terms of what external people can and can't Measure and look at when they're trying to understand how these networks operate And so can you maybe talk about perhaps your history of like interacting with those API's and then how that sort of like Changed over the last couple months or years Yeah, so so let's start with talking about kind of the current moment in Social and then I want to jump way back to when Berkman used to be my home base because it's when Sort of one of these projects started so we are I think a little bit more than a year into Elon Musk's reign over at Twitter It's been a very Interesting year something that I think a lot of scholars in the space had been warning for a long time which was that Having social media companies controlled by powerful for-profit Corporations is probably not the best thing for social media as a public space You could end up with decisions to make Social media simply very eye-catching as a way of selling ads You could also end up with social media essentially captured by someone's ideology and we have pretty good evidence that Social media has moved in that direction under Musk What we've had is a bit of a flourishing of alternatives people sort of saying Turns out having an open public space is really important Twitter was actually very useful And so we've seen everything from people trying to return to mastodon and try to build the Decentralized more open social future We've seen Jack Dorsey come in and try to get blue sky off the ground with Wouldn't it be nice if we could just go back to Twitter with fewer Nazis and then we had the very strange phenomenon of Mark Zuckerberg. I think in a wonderfully sinister fashion say Hey, I bet we could repackage Instagram and Lucky we managed to get a hundred million people to sign up for this in the first ten days People essentially saying well, it's still owned by a capricious billionaire But at least it's a less awful capricious billionaire by the way current active users on threads negative eight. Please continue negative negative fair enough So while that's all going on here's some other stuff that's happening Subarosa Twitter Decides that it's going to turn its API into a moneymaker and so it decides to charge $420,000 a month for less than What was the lowest tier of paid access and it's significantly lower than what academics were able to get is free access By the way, that's a 420 joke. Ha ha ha. He's such a funny guy It's all about marijuana But what that's meant is that anyone who was using the API for serious research can no longer do so some of the licensing terms are Particularly egregious They're demanding that we delete past Twitter data that we ended up having and by the way Can you talk a little bit about those apis with the specta scraping because sometimes something people bring up is like Oh, well, I mean, there's these profiles that are kind of public. Yeah Why couldn't we just sort of lean on that and so like give me a second to get there so so Must decides he's going to monetize the API Steve Huffman Never add a loss to be as horrible as possible decides. He's going to follow in his footsteps Reddit suddenly no longer has an open API, which means the tools like push shift Which did a great job of scraping and archiving that data no longer work you're right musk is a uniquely Problematic individual on so many different levels. There was a tool called SN scrape Which was widely used to be able to scrape public data off of Twitter Must team blocks that tool they block multiple accesses from IP addresses It's very hard at this point to scrape Twitter You could do it if you managed to have thousands of IP addresses at your disposal There are projects that have done things like internet circumvention that are very good at pooling IPs and you could proxy through those But it does really Beg the question of whether we need a technological solution to this or something else Other things well, this is going on these new platforms coming online Usually have not thought about Research API's to the same extent that Twitter had so it's possible to study blue sky. It's possible to study mastodon But it's not necessarily easy Threads with its negative 8 users Still doesn't have an API that we can study and work from and some of the existing tools out there like crowd Tangle which was at its peak a really brilliant tool for studying Twitter and Instagram for studying Facebook and Instagram Has been systemically Downgraded over time it is not as useful a tool as it was most of the original team is no longer involved with it Let me throw in one last thing the other thing that has really happened in the last three or four years is The rise of very small social networks that use end-end encryption so the importance of What's app and what's app groups as far as being social spaces the importance of telegram discord channel So on and so forth the problems with studying those spaces isn't just Technological, it's ethical That's a conversation designed for a specific number of people limited to people within that group Even if you can technologically find a way to study it arguably perhaps you shouldn't but what all of this means is that We know a lot about what happened in the 2016 election thanks in no small part to Yochai Bankler and His book network propaganda which used a set of tools that I started developing here at Berkman around 2005 called Media Cloud Which was basically an Unpermissioned which is to say we never went to the newspapers and said hey We're going to archive you and make it possible to study you We simply did it because that was public information and Yochai's work in 2016 used a great deal of Unpermissioned information to make a case for a particular way that information moved from the far-right blogosphere Into Breitbart into Fox News and then set the agenda for the presidential election in 2020 Much of what we know about that election comes from permissioned research. There's a wonderful set of studies. They're really fantastic studies Put together by a big team of academics. You're part of it. Yes Who are using data coming out of Facebook with Metta's? Cooperation which is incredibly important for certain types of research including experiments where people are Changing how Facebook or Instagram works and being able to compare one to the two and we know a lot about 2020 It's not clear. We're gonna know very much at all about 2024 We're losing our Unpermission methods because we're losing the API's and other tools It's not clear whether Metta and others are going to make the same corpora available for 2024 And I don't think we're nearly worried enough That's a great a great summary and there's a lot of sort of challenges that were sort of talked about there I mean one interesting challenge that you talked about sort of in the middle is this idea that even if there is publicly available data and even if sort of in theory Accessing that data is allowed by you loads or stuff like that There's still technical mechanisms that the platforms can use like IP block listing for example to prevent that scraping from taking place I think that's sort of a technical nuance that a lot of people don't see sometimes because Access to a large number of IP addresses, which essentially is what you would need to sort of get this to work That's actually difficult to pull off for a lot of researchers who were sort of in academia and sort of don't essentially they don't understand how to Circumvent what an essence is sort of like these censorship mechanisms that you'll see And so and so a lot of your research sort of depends on having some type of portal into seeing how these things work Do you sort of want to maybe talk about some of the newer projects that you've been looking at in this space? Yeah, let me talk about two things and maybe we can use them as a way to Kind of talk about permissioned and and unpermissioned research So let me kind of set a stage and then try to explain where these things were coming from Since 2016 There's been a huge interest in missing disinformation I think that works really important, but I also think we often Focus only on the surface of the problem and let me explain What I mean by that There's a lot of research out there including research. That's very good at grabbing headlines that says I Looked for a bad thing on the internet and I found it so it goes and looks for Here's information about Iver Mectin being shared on Twitter and this is easy research to do you go in you search for Iver Mectin or you search forever what you're you're gonna find and In many cases you can come back and say I found N-thousand posts reaching M million users and these are great big numbers and they look really good in headlines The problem is they're a numerator without a nominator So for instance you look at a vase did some work on Facebook Basically saying look at how much vaccine misinformation is coming out These posts got a billion impressions over the last years James is a billion a lot or a little Well, I guess this speaks to your question. You know, what's the denominator here? Well, the number on Facebook is more than a billion Monthly active users so the idea that you know a Facebook user would encounter a piece of Misinformation over the course of a year Does not strike me as particularly? Surprising or stunning the other thing about this by the way, it's not just the proportionality It's the concentration, right? If you are in the Iver Mectin are us Facebook group you're probably gonna see a whole lot of Miss and disinformation in part because you're choosing it if you are a random Facebook user your odds of encountering this are kind of a lot more Slender but here's like a bit of a hot take though Yeah, so like, you know an argument that frequently comes up is that like oh so a billion sounds like a lot But really in the whole sort of grand scheme of things There's like a billion billion impressions every day. Let's say so fraction fractionally It's like pretty small, but there's an argument that says like well, but back in the day by back in the day This is like computer time. So, you know, it's like 15 20 years ago So back in the olden times when things were in black and white Uh, you know the likelihood that you would encounter one of these views at all Was smaller and this is sort of like the person with the sandwich board argument that you know You were less likely to encounter these things just in your journal sort of Day-to-day activities and you know by this argument people would say well big things start small And so, you know the way that people get radicalized or things like that is like It's it's a it's a percentages argument. So, so what do you think about that? I mean I think it's an argument that only makes sense if your history of media only goes back 15 or 20 years I mean, you know the late 1930s father Coughlin had the most popular radio show in the united states And he was actively being funded by the nazis And was actively, you know putting out propaganda Essentially participating in a in a replacement theory There has been extremism In media even at the peak of the broadcast era Uh, you know, you have the john birch society, you know having enormous amounts of influence for people who are sort of choosing to have it If you look in discourse around this you can find wonderful pieces like richard hoffstader's the paranoid style in american politics showing up at the peak of the broadcast era So, you know, there's always been A conspiratorial voice It's often a nationalist white supremacist voice Um, yeah, internet makes it much much easier to find it and encounter it But I think in some ways that the problem is that we're very good at counting it right now and because these spaces are so big It's very easy to come up with big numbers and not actually think about either the proportionality or who's actually Looking for it. Um, I find brendan nyhens work and some of the work that he's been doing up at dartmouth particularly helpful Where he has periodically found ways to through data donation piggyback on What people are encountering on youtube? And what he basically finds is that yeah about 10 of the youtube users he found Bumped into what you might call extreme material What's really interesting is about two-thirds of those folks immediately backed out and were like no, thank you um Roughly three percent of his users were like, yeah, give me more of that but what's interesting is he'd pre-surveyed his users and Those tended to be people with very high levels of racial animus and by the way gender animus because The races turn out to be sexist most of the time. So, you know races got a racist There are people who are generating those page views Who are clearly using these tools to look for that stuff You can trace back through american media history and find those folks over time I think we're better at counting them now and I think we're better at trying to get headlines for them But all of this led me to this idea that I wanted to study the whole thing Like I wanted to study the whole platform So my lab over at umas has been working on a bunch of different tools that allow us to Sort of think about how we might study the whole thing. So let me show you Two of them really quickly which have some strengths and weaknesses associated with them So this is reddit map dot social and you've probably seen Some projects in the past to map reddit This is a pair of visualizations one is The standard social media hairball that we're all real good at generating these days And this is done by basically saying Of the top 10 000 reddits subreddits on reddit Which ones have the most commenters in common? So it turns out that if you comment on r slash boston You probably also comment on r slash harvard We put the two of them closer together and we would pair them in the space of the hairball Now the problem is the hairball is really tough to navigate So we decided to do something interesting. We took a much more powerful technique Than all of these various different graphing techniques. We used something called a librarian And we asked her to Come up with a hierarchy that humans could navigate And so we're now able to do things like say well if you're looking at the finance space on reddit What's actually going on there? Well, it turns out that there's sort of hustle culture. There's stock training There's crypto mining and you can go down and you end up in some really interesting places You end up with overlaps between People who are excited about teslas, but also people who are uber drivers and lift drivers who are sort of hustling around those things The reason we did this was we wanted to try to put certain phenomena in context Reddit usually comes into the public eye for one of two reasons Either it's things like stock trading Or cryptocurrency or sort of the wilder corners of the finance markets And part of what we were sort of able to demonstrate is it's a pretty small part of what the site's about Actually, what most of the site about Is things like local information local communities People's hobbies whether they're online or offline Things like really tremendous online mental health communities that often are really high functioning And really help people through things like new parenting and things along those lines The second thing is that people love to Think about reddit and porn and drugs and and porn is a big deal. It's a big category It's not as big as you might think it's actually much smaller than some of the rest of them So our goal here was to try to deal with issues of magnitude Cutting across reddit as a whole A couple things I just want to mention here really quickly This Whole visualization the whole data processing is done by jasmine mangat Who did it as an undergrad at umass? And virginia partridge who did it as a data science masters student People are very fond of asking me What's it been like moving from MIT to umass the answer is the students are better and they're more fun to work with We've got a chance for amazing amazing stuff Coming out of our lab at umass and I feel great about it This data all comes off of push shift Which is kind of a classic Unpermissioned project jason Baumgartner Started push shift many years ago to archive what was going on on reddit Has had a number of fights with reddit about what are the privacy implications of this so on and so forth Managed to maintain this up until the summer When reddit decided that it wasn't going to make the api open anymore So, you know a couple of interesting things about this this is designed so we can watch how this changes from month to month But we can only go back in time at this point. We can't go forward Unless we start the process of scraping reddit ourselves and we have reason to believe that we'll be shut down when we do it We've also tried to use this for fun stuff like Looking at protest and being able to look At how many of these subreddits went offline In protest of the sort of change of reddit politics So this is I think a supremely useful tool It's still a useful tool in exploring reddit But it is no longer an active and live tool By virtue of the fact that reddit has essentially said that it cannot be an active and live tool Through c itr coalition for independent tech research, uh, which you and I are both involved with we've been in touch with reddit We've been asking for rights to keep doing this We have no bargaining power. There's no position from which we can bargain And so by the way the coalition for independent tech research, it's basically a group that is Sort of focused on trying to allow for people who are outside of companies Outside of these platforms actually do responsible Transparent research on those companies And so this group was formed in part because of these types of concerns Where we've got these platforms that are increasingly intertwined with, you know, important social decisions And yet there's not an easy way increasingly to sort of get data like this And you know, we hear a bunch of different complaints from these companies about why, you know It's infeasible to sort of support apis like this You know, so one theory that a lot of people have is that these companies don't want their data to be scraped by generative ai companies, right? They view that sort of user contributed data as in some sense their data That shouldn't just be used to make money for other companies And so, you know companies do need to make money in some sense like they have to feed their employees Can you talk a little bit about how you see that sort of tension? Maybe being resolved in a way that both allows these companies to, you know, make money but at the same time sort of Serve society and being able to understand Yeah, so let's talk about that and then I want to and then I want to show a second tool that that's doing Unpermission work at least for the moment So yeah, Steve Huppman's logic behind this Was that the easiest way to train a large language model these days is to download a snapshot of push shift Feed it into something like Llama from meta And train your own model from the ground up In fairness, there's a lot of really big text corpora out there like c4 the the colossal clean common crawl It's not clear to me How much work is actually being done by grabbing things like push shift and and sort of analyzing them in that fashion It's a really nice way to try to make a claim to your shareholders of look how valuable this is look We've got you know, the raw material that all the ai people want Obviously, we're worth a lot if we decide to go public But it would be really easy to essentially say Don't use our corpus for that and if we find that you're using your our corpus for that We're going to take legal action against you without just shutting down the corpus in general apis Have liabilities for corporations So reddit has at least three reasons why it doesn't like the api right? It's got the stated reason which is people are going to use it to train ai's Here's an unstated reason reddits tools for accessing the site are terrible reddits client sucks It's miserable. They've never had good search Nobody likes using it people used third party clients. I used one called bacon reader for many many years Those clients sold their own ads. They made it harder for reddit to sell ads And shutting off that whole ecosystem is a way to try to make significantly more money The third thing is that you know, you can use these tools To do research. That's embarrassing for these platforms You know, we can end up somewhere in all of this. I would have to find my way around it. Yes. Here we go. Weed cbd, you know They're probably not thrilled about the fact that it's really easy to find many of these Drug spaces On reddit, you know on the one hand they enjoy having the traffic associated with them But you can imagine, you know having a lot of fun going through the pages of drugs on reddit And sort of talking about this I will say by the way My my favorite discussion in sort of studying the drug space and we had to figure this out because we were doing Um, you know really a librarian type hierarchy on this by the way, we didn't do a hierarchy for porn We we felt like that was maybe More than research really demanded of that put but we did do a hierarchy for things like the drugs We found out that there's an amazing split In the community between meth and meth without socialism So there is actually a less woke pro meth community on reddit So the pro meth community was evidently too liberal for some of the meth users. So meth without Methamphetamine so meth without socialism asking for a friend. I'm sure yeah, but You can find this by by exploring through this tool and and and you can imagine why You know, that might be a pr problem for something like reddit Once you start mapping this out and start seeing how large the sort of drug space is there So if you view an api as essentially Only downside, right? It's going to open us up to AI, you know, looking at our site It's going to open us up to third-party clients interacting with our site And it's going to open us up to research that makes us look bad Those become really good reasons to think about closing it down. What are the upsides? Well, one of the upsides is People might actually know what your platform is used for and not used for you might actually have the research community Giving accurate information On what your site is and what your site isn't That seems to be a much harder sale to make towards these platforms but Telling the story about Unpermission research We have a tool that We have not officially released, but I will share with all of you in the room and all of you online This is in support of a paper that should be coming out this week from the journal quantitative description We decided that we wanted to understand What's actually on youtube and that sounds like a really strange question when you put it out that way I mean, obviously we know what's on youtube. We all use youtube almost every day. It's by far The most active You know site other than a search engine on the internet The truth is we know an enormous amount about the 0.01 of youtube in terms of popularity What we do not actually know is very much about what's on youtube as a whole And part of this is that youtube gives you somewhat limited ways of exploring it youtube has a really good api It's actually it's a terrific api You know 8 out of 10 it does a lot of what it needs to do What it does not give you a way to do is get a random sample of videos So here here's a question for the crowd here and you could probably figure this out by by looking at the graphs What's the average number? Sorry, what's the median number of views for a youtube video across all of youtube? What do we think the median number of views ends up being just give me a rough number You are interpreting the graph correctly. So this is what happens when I already put this up You are almost Yeah, it's it's 39 So so the truth is if you pull up youtube right now and you look at all the videos featured there They're all going to have views 10,000 100,000 Well into the millions it turns out the mean is about 3000 the median is 39 There is a very small number of hugely popular videos. There is an enormous amount of not particularly Interesting popular for someone and so forth videos. So here's how We're able to do this We are now able to tell you the median youtube video Because we came up with a method that allowed us to select 10,000 random youtube videos The way you get a truly random youtube video Is the same way that you might drunk dial people on your phone if you were really drunk If you dial completely random 10 digit numbers Some percent of them will answer the phone And so in this case rather than a space of 100 million phone numbers There is two to the 64 possible youtube videos. That's a number about five a quintillion If you dial at random you will get a video roughly every 130 billion times And we know this because we actually found a way To essentially dial about 100,000 numbers at the same time So we found a method to sort of say hey youtube Here's roughly 100,000 places where we think there could be a video Let us know if there is one there and very slowly very patiently over about four months We ended up trying many tens of billions of addresses And out of that we got 10,000 legitimate videos And so youtube didn't get upset at that youtube never noticed Really? Well think about it youtube has hundreds of billions of legitimate accesses every single day The levels at which we were doing we were doing it very slowly We were doing it over a long period of time We were being very very careful to back off the server if need be We were able to do this over a very very long piece of time The main thing we did with this was then document a much more efficient method for getting random youtube videos Um so there was a method that someone tried out in 2011 It actually has to do with a bug in how youtube's addressing works You can get youtube to autocomplete a certain type of url If it has exactly five characters and one of those characters is a dash You can get youtube to autocomplete the rest of the url We retrieved those videos we compared those videos to the purely random ones They're statistically identical therefore, please don't do our stupid drunk dialing method Do the dash method we have code for it so on and so forth What this now means is we can generate without billions of requests only with Tens of thousands of requests We should be able to generate a hundred thousand random videos from youtube a month What can you do with that a couple of things? First of all you can calculate how big youtube is So youtube as about a week ago is about 13.3 billion videos You can see how fast it's growing youtube continues to grow Explosively there was a a bit of a bump up during the pandemic, but in general it's along an exponential curve With that sample you can do some fun stuff like throw it against open ai's whisper and figure out what languages youtube videos are in It turns out that we can identify about 20 percent as being in english we can identify about 6 percent as being in hindi Spanish portuguese russian are also very highly represented But here's another thing you can do once you've got this model You can look at an arbitrary youtube video and you can try to figure out How it fits in the overall popularity of youtube So you can look at you know my benighted ted talk from 13 years ago Which you know did not do particularly well and you can find out that it you know ranks in the 98th percentile on youtube Which on the one hand you know makes my mother feel really good, but feels pretty bad for me as a social media influencer Um, but you could also do things like find out This piece of misinformation How popular is it? visa v Other posts in arabic other posts In the last year on youtube So let's be clear Totally on permissioned research, right? We actually use youtube's api, but we use it to pull down the metadata From a sample that we are getting outside of youtube's api We are actually using a youtube Undocumented api colloquially referred to as inner tube If you look up inner tube and youtube online you will find a whole set of libraries developed To take advantage of this internal api This internal api is intended for youtube developers who are trying to build new tools to interface with it It's well known in the research communities youtube has not yet shut it down We hope that youtube will not shut this down, but they may Here's where we get into What I would say are so-to-the-counter arguments for all of this There are ways to do this research irresponsibly We could take this list of videos that we found and that we're generating every month And say haha. Here's the funniest lame youtube video we found Here's this video that only got three views and was posted in 2012. Look at how terrible it is That would not be a good idea right that person is entitled to security through obscurity And I would argue that the vast majority of those videos probably are entitled to that sort of security You've got fewer than let's say a thousand views 10 thousand views You probably do not expect the general public to be going and finding those videos That's why we're releasing this in terms of models. We're basically saying If you found a video you should be able to hold it up against our graphs and figure out how your video plays into it But only under very certain circumstances would we release that whole Data set and you would need to give us very strong assurances about personally identifiable information So just to be clear by security you're you're kind of referring to a form of amplification That is sort of outside of the built-in amplification algorithms that youtube currently has A lot of people refer to this as the justine saco problem So this is the classic story of the young woman white south african politically very liberal from A family associated with the left in south africa Had about 130 twitter followers most of whom knew who she was and what her politics were On her way home to capetown tweeted out i'm going to south africa. Hope I don't get aids. Ha ha i'm white Not a great tweet But a very different tweet if you know who the person is and she's talking to 130 people who know her Instead ends up being amplified by gawker ends up costing her her job most of her life, you know Really ends up being a huge explosion For most of us who research social media that idea of Undue amplification Of something that you assumed would be private. That's one of many risks associated with this So there's at least two risks associated with this right if we did this badly We could put an undue load on youtube Cost the money and make it hard to maintain their service I feel pretty confident that we haven't done that YouTube seems to run okay and no one has come and objected and we're making no attempt to hide Our tools for doing this in fact they're signed. They say contact us. It's such and such if you have a problem with this Second we could do harm With privacy to some of those users who don't expect to be released Which is why we're releasing aggregate data and we're being careful on the release But the hope is that there's also some benefits from this That we can look at this and say Here are some things that you probably didn't know about youtube From our hand coding of that data about 20 of videos on youtube are gameplay They're basically people Playing a video game and streaming it in real time There is surprisingly little political or news content on youtube. It's about three percent Uh, maybe it's closer to two percent at three percent is religious content Those are people live streaming church services or sermons or something along those lines Can you talk a little bit by the way about like possibilities for methodological skew here though? Because I think you're uh, maybe uh, let me know But like it sounds like basically the way that you collect these videos is you basically do sort of a random scan through the Name space. Um, and then when you get a hit when you actually find like a A name that corresponds to real video then you can actually record various statistics about it So like does that have like a recency bias for example? It shouldn't because the namespace has been consistent from the start of youtube in 2006 They've been using the same type of identifier Um, I think if there were a radical change you would see A real change in the video uploads per year where actually what you see is a very smooth curve over time We are pretty sure at this point we got a hundred page paper sort of documenting it That this is an accurate correctly stratified sample of youtube Which we don't believe anyone's done in more than a decade We then did the work of hand coding A thousand other videos out of this trying to get a sense for what they are We also did the work of of sort of doing automatic language detection on some of this And we feel like we probably have An accurate picture So long as by accurate you mean a slice of the entirety of youtube Now there's a very reasonable objection to this which is why do you think that's the right way to study youtube? Certainly the people who have videos that are getting seen by a hundred million people a billion people They're going to have more influence Absolutely true absolutely legitimate way to study youtube This is critical context So the example that I like to give on this is that ted talk of mine that ranks at the 98th percentile That's a great example of people trying to do The let's go viral and succeed model right i wanted my ted talk to be enormously popular. It wasn't I hope to be mr. Beast. I make a lousy version of mr. Beast and so I end up in the 98th percentile The amherst town school meeting The recent video of that has 63 views on youtube They don't want that video to go viral something's gone really wrong if that video has gone viral They're using youtube for something very different. They're using it for transparency They're using it to increase participation They're using it for archives. They're using it for a form of openness We have this real tendency when we study something like youtube to only care about the very high attention stuff mike sugar Mike sugarman in my lab has an analogy that I really like Imagine we were studying public parks And the signal we used to figure out what was important in public parks was volume We would basically study buskers and the mentally ill And we probably miss all sorts of legitimate uses of public parks because we literally wouldn't hear them And my sense is that there's an enormous amount of the social internet that We can't study and don't study Because our tools are very very good at picking up those high attention uses and not the lower attention uses That makes perfect sense and by the way, that's like a sort of an implicit slander against taylor swift Who we all know sort of dominates all the social we're not we're not against you taylor swift We know that your fans are very vengeful So maybe this is a good time to take some questions Um from from the audience or online if anyone has any I see a question back there I could maybe get a get a mike back there Thank you Hi, thanks for a great talk. My name is Lisa Austin. I'm from Toronto. I'm visiting here as a visiting scholar this year And I'm a I'm a law professor. So I'm going to ask you a legal question Which is um, what difference do you think uh, the eu's dsa act that gives researcher access will make will it put pressure on other countries to pass laws like pata or Just because if they make research access Available there will it be easier to make the case to have it voluntarily done in other places And um, you know, that's just a kind of small piece of sort of the transparency that we need There's lots of other things that have to happen, but I just curious how you think about that development Sure. So let let let's start with unpacking some terms for for all involved. So, um The eu has recently passed Um, a linked pair of a very big very important bills the digital services act and the digital markets act part of what those bills try to do is require Some transparency From platforms that they're referring to as the veelops the very large online platforms The hope is that this might provide some support for a piece of legislation that I strongly support in the u.s Called pata the platform accountability and transparency act Which looks for three different ways to try to increase transparency for u.s media platforms The trick now becomes in the implementation details of these things So we actually tried to sign up for youtube's research interface through the dsa Um First of all, we're not eligible because we're not an eu entity although we can certainly partner with EU entities We then have to write a detailed application about why our research is in the public interest All right fair enough. We can probably make a case for that But after going through this detailed application process If all goes well, we will be granted 5000 api requests per day Now 5000 api requests per day might be okay if you're researching a small phenomenon like People using ivermectin, but it doesn't let you do the sort of proportionality research that we're doing Where you're getting a large sample or sort of doing a model from anything like that So what you're getting is is very predictable. It's de minimis compliance and Europe is doing what europe does europe is trying to be out in front regulating first and the platforms are going to react By saying Well, we will do the bare minimum that we can to comply with this Here's part of why i'm so enthusiastic about pata Pata is actually taking three approaches It's taking a white room approach, which is essentially the approach that james and colleagues were able to study facebook with Here's a way of having controlled access to data. We give you queries We never get the data, but we get the results coming out of it. We can design experiments with you There's a white room component to pata There's a real-time analysis to pata Which is basically crowd tangle plus plus like what's the most popular stuff that's on platforms in real time That's hugely influential The part of pata that i've been working on instead of advocating for is the unpermission part of pata and it basically says Here is safe harbor for activist journalists and academic researchers To do certain types of research that is unpermissioned on the platforms Here's why that is so important There are platforms that are never Going to give permission to do some of the research that we need to do A great example in the us would be gab dot ai, right? This is explicitly a white christian nationalist platform And they're going to deny people the ability to come in and research violent christian nationalism on the platform despite the fact that we really need to do it Unfortunately There are platforms that are more mainstream than gab That are no longer particularly helpful or compliant So for instance twitter under elan musk Is being incredibly compliant with censorship resists Requests from the modi government Ifa gazia is a doctoral student in my lab is in the audience. She's doing work on Hindutva harassment of cashmere activists on twitter, which is a real pattern But it's extremely unlikely that twitter Really concerned about staying viable in the indian market is going to make that research possible at the same time that they're closing down transparency across the board So it's great to have specific laws on the books Platforms will react to them by giving us as little as we possibly can We need much broader protections for research in the public interest And this is a moment where we actually have to trust the courts if we could get the courts to say Look we can make a distinction between Oh, you just sucked down all of reddit so that you can train your own chatbot Versus you are analyzing piles of reddit data Because you found a connection between low-wage work and mental health issues Those are different projects and they should be treated differently We have a question Hi, hi, Ethan I just want to say thank you for doing this and thank you for being the ethical researcher that you are as a Early career researcher and being somewhat connected to some of your projects. I've just been in awe but On that note, I would love to ask for an update about media cloud In connection to these new tools in part because I'm also curious about what you think about Scraping the open web per se. Is that still worth it at this point? And if so, how are you going about it? If not, why not? Yeah, thank you so much Let me let me just say This question of what it means to be an ethical researcher is is hard And anyone who works in the space spends a lot of time talking to Our colleagues When I started work on this youtube work. I called up kasey fiesler over at see your boulder Who is one of the best thinkers out there about involuntary data release? And I wanted to think with her About what the harms might be associated with this and and in general what I would say is Even those of us who are a fairly senior stage of our career benefit enormously From asking those questions about harm and then trying to figure out how to mitigate them Media cloud is a project incubated here at the berkman center Supported for a very long time here Based on this idea that if something is News which is to say someone is trying to report what is going on in the world And putting it out there on the open web We should collect it and be able to study it through tools that are more open than something like laxis nexus so for 17 plus years now we've been collecting data At media cloud.org We have quite powerful tools to let you go back about 12 years And look at that collection comprehensively and that Tool lets you do things that are really hard to do otherwise Um, so for instance if you go into a collection of us newspapers From the year 2020 to 2023 You can find out that 43 of all newspaper stories in march 2020 mentioned covid That's down to under three percent at the moment So that's a big cultural shift in terms of what we're paying attention to where we're going I think it's really helpful to be able to ask those questions about what's being talked about how's it happening We have never had legal problems with media cloud except from harvard university harvard has been the one entity that raised questions of Contributory copyright infringement and what sort of impossible harm we might do To this lovely institution if we continued the project um I had pretty much the dream team of defense counsel to talk with harvard's general counsel. I was represented by uh professor zid train banklor and paulfrey Who instructed me to shut up and let them talk? professor banklor in particular makes a case that it is possible To make an argument for what media cloud does solely by citing actions google has taken over the years in constructing its search engine And I think that's probably correct But interestingly enough other than that conversation which had something to do with my institutional affiliation There have been no major legal pushbacks on media cloud The major threat to media cloud is that it's really expensive to run And these days it's really expensive to run because we made an ill considered decision a couple years ago to move a lot of our back end onto amazon web services Those costs have gone through the roof. It now runs us roughly 30 000 dollars a month to run the project We are looking for better ways to do this If anyone's got some good contacts at azure, we'd love to talk with them. We tried moving it to the internet archive That didn't work. I have lots of insights on why that might be We're continuing to try to provide tools like this both because they're useful to researchers and because I think Documenting that ability to study something big like all of news Is incredibly important and I think it's a necessary corrective For really worthwhile work that looks at what are people saying about x but giving that broader context Very interesting thoughts from both of you. I appreciate it So now more than 100 years ago. I'm 75. I'm from India and more than 80 or 100 years ago There was a philosopher in India who mentioned if there is one person starving without food Then the whole planet should be destroyed. That was his philosophy very idealistic But it is impractical. So likewise you mentioned about the cashmere example now Yes, the cashmere people may feel that they are pretend oppressed But the rest of from what I know I am here But still I feel the rest of India is not the Hindus. They are not feeling any sympathy for those people. I don't know whether the It is the political philosophy or religious philosophy or the media manipulation In the same way the extremists they are everywhere in every part not only white extremists Then there are other extremists too. That's my comment and people are selectively commenting on what they do not like When there was killing in Israel the Hindu party immediately condemned it but the congress party in India didn't condemn it They they were silent So I'm asking a question. Is there a do you know any example wherein The situation is manipulated in the media and it is it has been successful in changing the opinion of the people So there's enormous Amounts of media manipulation Occurring in many places around the world right now I think some of the most important That's happening Is the silencing of vulnerable voices and so I think I would object to your characterization of India at the moment respectfully Because it's very very difficult for many of those dissenting voices to safely speak out There have been in Kashmir in the last decade Long sustained internet blackouts designed to silence people from speaking There have been arrests of prominent figures like kurram parvez Who there is no Evidence that they are involved with terror, but simply seem to be advocating for a kashmiri point of view What gets very challenging Is when political power intersects with technological power So there are at least three things happening simultaneously around kashmir There's traditional state power which can arrest people harass people into silence Threaten to pull people's passports threatened to bulldoze people's homes All of which can be very effective in silencing people There is indian pressure on platforms like twitter Which no longer provide any resistance at all To take down requests for kashmiri content pro muslim content so on and so forth That channel gets closed There are also Aspects of pure censorship which is to say people who are sympathetic to the government Who are making life so difficult online For for instance visible kashmiri muslims that it is essentially impossible for them to have a voice I am choosing kashmir as an example in part because it is something that I am actively studying With my lab I do not mean To pull up kashmir as a unique example We could make a very similar case about what's happening in palestine about what's happening in east turkistan so on and so forth It is really important to understand That the public sphere these days is digital as well as physical And so understanding that control of the public sphere Involves decisions made by platforms Often about places they do not understand very well This is an incredible problem I had the privilege earlier this year of going to a meeting in rio Between brazilian researchers and filipino researchers and they were trying to figure out What happened in their respective 2022 presidential elections brazil Managed to avoid a military coup very narrowly actually The philippines instead elected the son of a dictator whose main platform was rehabilitating his parents How did that end up so different the one thing that they could agree on Was that twitter facebook youtube? Did not understand their countries did not care about their countries and was not following their country's laws and that Is an enormous problem It gets even more problematic when you have a state like india Which may be using those laws to persecute a minority And then using state rules to restrict the use of that platform And this is one reason by the way why these uh transparency apis or at least api access to allow that external research is so important I mean there's been a lot of great research for example in the specific case of india and the washington post for example About the pressure that's being put on twitter and these other platforms and it's not just about india as you were saying It's happening all over the place and so as these spaces become Sort of these privately run sort of but public spaces as some of our colleagues here have talked about before like juan chow It's very important to allow that type of access um guzzo. Did you have some questions on yeah? We are unfortunately reaching the end of our talk, but still two more questions to go one from Me summarizing the a few of the listeners and viewers online coming in with questions in the framework that I would work on this question is The the these systems that we're working with are extremely complex and the research is responding to The issues and damage that this technology May cause the issue the central issue here is of data access Question from jonathan is how do you think academia and industry could reach a durable understanding of what level of data sharing is sufficient While accounting for privacy business and technical constraints law Also adds to the pile of questions by asking Uh Is the most of the inter internet benign is the most of the internet benign and should be uh be semi Should can we use these tools if we're created and semi willing to ignore terms of service and joseph is also Suggesting new innovative approaches and creative approaches and really wanting to hear from Your lab at umass ethan if you have any Suggestions such as a common ground etc. How is it going so far? so let me let me start with the benign question because I I think it's um I think it's actually really easy to answer Uh, I think the internet for the most part absolutely is benign Um, and I think in many ways We have trouble Seeing how useful and how benign the internet is Because we're very good at amplifying bad news Um, I think everybody here Has seen headlines about the internet Causing harm to young people lots of fun stories about Technology leaders not allowing their children on the internet Instagram is making us miserable and sick and so on and so forth What's really interesting is you start looking at large n studies, which is to say Tens of thousands hundreds of thousands of people Or even long-term panel studies like the traumasos study which has worked with a panel of children from early childhood up into adulthood And what you find out is that the effects of the internet are a bell curve There are some extremes where for some people the internet's really bad But it's a fairly small group of people There's a small group of people for whom the internet is really really good It may be the thing that saves them from suicide or allows them as a queer kid in a rural area to thrive or helps them find people They really need And for a lot of people in the middle The internet is just kind of part of life and it's not strongly associated positively or negatively That's not a fun headline to write. It's a much more fun headline to write about the extremes This doesn't mean we should not try to correct the harms at the extremes. We should But one of the reasons why our lab is starting to work on what we call the quotidian internet the everyday internet Is that so much of the everyday internet is fairly innocent harmless Beautiful and lovely in its own ways, right? Yes, you have mr. Beast buying 10 million red legos so that you can have the most Red legos ever and getting billions of but there's also Someone writing a love note To their spouse and posting it on youtube because it's the best way to share a video It was really only meant between the two of them or Someone streaming a service from a church for the person who can't get out of their house instead of get in there and those Counter-balanced to one extent or another many of the awful things going on The question from Jonathan is is really interesting because I I'm gonna disagree with the framing I don't know that the answer is finding a I actually think this is a right question I would make the argument that or I do make the argument I teach a class in the fall called defending democracy It's basically a class for undergrads on the public And I make the case that the u.s Someone unusually is a nation built around the public sphere It's a nation that basically says if you're gonna have a democracy you're gonna have a public sphere We have a constitution that refers to newspapers of the post office with basically these drugs An interactive public sphere of the prince using the destination And our ability to maintain and monitor and understand that public sphere Is deeply connected to our successes in democracy So I actually think a rights framework is the way to go after this. I think we have a right to study these rules in some of the ways that I am studying them I am happy to have those conversations with platforms About how to study them in less confidential fashions What I am not ready to do Is concede that right? I am not willing to say my ability to study youtube is dependent on YouTube's willingness to be studied by me And so that's the distinction that I would try to offer there Do I think we could get better about this? Absolutely, but it doesn't look like it looks with the dsa The whole we're gonna pass a whole package of legislation and you're gonna respond with 5 000 requests after jumping through hoops No, that's that's not a meaningful negotiation And you know dsa and dma is heading in the right direction by saying look we have a right As the e you to understand what these platforms are and what they're doing to us Now we have to find a way to defend that right and implement it I think it's a really great way to look at it and like a rant that you've some of you have heard me say already Is that sometimes the companies will say oh, maybe you have a right to it, but how are we gonna do this? Oh, oh poor us, you know, we don't know how to scale up or things like this And as I've ranted before they never say this about ads They never say it's just too they're just too many ads that we could give to people How could we ever figure out what to do? And so I think that you're right that if we sort of come at the conversation from the perspective of We as outside researchers want to work with you But there's a certain right to access a right to study these platforms and have that being the starting point I think that's a great way to start that Well, that is an omen that now we've uh, we've run out of time. So, uh, thank you so much I think that Ethan will be here for a couple minutes afterwards. I'm gonna be hanging out if anyone wants to talk Thank you so much. It's so good to be here and it's wonderful to To talk with people and take these questions. Thank you for coming here. James for having me. No problem