 So I thought what might be interesting to tell you guys about is not a top-level abstract view of what science journalism is or why it matters, but give you a very granular, from the trenches, specific point of view. So I'm just going to tell you what I did last week, just blow by blow. It'll hopefully clear some illusions about what science journalism is and why bother having it and why put it up with us. But first I do want to have one quick view of why, let's say, how a journalist sees data. So this is data from the Google Books project and it was created by a group at Harvard who essentially scraped engrams from the Google Books project. Google Books is a digitized sample of about 5% of every book ever published and engrams are simply clusters of words. So apple pie is a 2 gram, my apple pie is a 3 gram and so forth. And you can use this interface to look at the distribution of engrams over time in all published books using this sample. So here's one that I really like. This is God and data over time. The blue line up top and the y-axis is the percentage frequency of this engram in all books. So if you're approaching 1%, which it is, or even a tenth of a percent, that's a lot of examples of this engram. So right through the 19th century, God was doing well. Right around Darwin and Civil War and approaching the industrial, you know, the latter half of the industrial revolution, God started to slump and upstart data started to rise. The inflection point, interestingly, is right around 1973, I'm not sure what happened. I think it was a silent event, but data overtook God. And what I find most interesting is that God's making a comeback. And this data actually ends in the year 2000. I wouldn't be surprised if God has overtaken data once again. So this is just data. If I hadn't shown you this graph and spun it into a story, it's just another couple of rows in a massive, never-before-seen database. So this is what I do. This is when I'm happiest. When I find something and I get to turn it into an actual story. So last week, yeah, let's keep them to a minimum because I'm going to try and rumble through to half an hour. But go ahead. Web data is non-random. It's far from non-random. There is no such thing as a random data set in the real world. But let's get into the specifics at the end. So last week, and this is an example of kind of like a bread-and-butter situation for a science journalist. A paper is going to be published. In this case, it was this paper, which is, I think it's science. Ah, science advances. This is science's new open-access journal. So it's like PLOS 1, but published by the same organization that publishes science. So a little known fact is that all science journalists know what the news is going to be a week in advance. So everything that's going to be published to a first approximation gets turned into a press release, which is then sent under press embargo to essentially every science journalist in the world. There are so many of them that aggregation systems have sprung up to help us like make sense of the fire hose and try and find things in it. It's a weird system that doesn't exist anywhere else in news. Basically, no one else uses the embargo system. I mean, imagine any other area of life where the media agrees with the powers to be, we won't write about X until you say it's okay to do so. Somehow we inherited this system. It's been going for much longer than I've been a science journalist, and there it is. So by that quirk, I know what next week's news is going to be. In fact, I could tell you right now what the headlines are going to be in The New York Times with regard to the latest breakthroughs in science, nature, PNAS. It's a strange situation, and it's an interesting data set. I've scraped it. You can see interesting trends in it. So this paper was going to come out this week. I knew it a week in advance. And why is it a story? Okay, so the headline is that women are getting much less for the exact same product on eBay. And the gender signal in eBay is extremely small. And this is the first question I had to ask. So I read the press release. The claim was big and bold. Basically, there's a gender gap in eBay. So the first thing that I was suspicious of is, well, how much of a gender signal even is there? If you buy something on eBay, which I've done, I am honestly never even conscious of the gender of the seller. You never see them. There's no profile pictures. So the first thing I had to check, by reading the actual paper, was how did they sort out that problem? The answer was very nicely. They hired a huge number of people on Mechanical Turk to go to eBay sales and try and guess the gender. And they had three choices, male, female, or not enough information. And they did better than 50% accuracy when they guessed, and they only got it wrong like a very small amount of time at 5%. So boom. There you go. There actually is a gender signal. And the proposal from the authors was that it's coming through partially by what else you're selling, because that's some of the information you're getting about a seller, and also just the user name. So if you use a female or especially female first name in your eBay seller profile, then that's a gender signal coming through. Okay, so now I was convinced. They might have something here. It's a good experimental setup. And what did they actually find? Well, the next question I had was, okay, you found this big gap, but what if the thing that's mediating this effect is the text or the terms of the sale? Maybe women are getting less money because they're less self-flattering. That was a good hypothesis. Maybe they're just more realistic about their products. Like, you know, this is the best thing ever, and women are like, well, it's used. Maybe that's it. So I had to read the paper, and indeed they did the hard work. They actually did sentiment analysis on all the text of these things to see if there was a distribution difference. There was a tiny difference, but when you controlled for it, it didn't explain any of the gender gap. Okay, nice. But the most compelling experiment, which this is nowhere in the press release, you really have to dig into this, and one of the things that really flinches it is gift cards. So in eBay, you can buy a gift card for 50 bucks at a certain store, you know, online or real world. It's the exact same product, the exact same value. You know exactly what you're getting. If there's a difference between men and women in the final amount of money they're getting for this thing, then that really is very likely to be some kind of real gender gap. And indeed there it was, the exact same effect, just as strong. So it was about like 79 cents on the dollar, which that by the way is the gender gap often cited for salaries. Coincidence? So I had a pretty good story here. And in spite of it being kind of about controversy, this is not an adversarial story. There's no bad guy here. I guess the bad guy is all us men in some way, or society. But there's no one being directly sort of criticized by covering this. So this is kind of like a typical science journalism situation. It's what they call the cheerleading for science situation. You basically just like tell it like it is. Here's the story, I checked it out, here's the research, strengths and weaknesses. So this was like the first few days of last week. Meanwhile, I had another story to juggle which was a little weirder. So there are these crowdsourced games now that scientists are using to get data. The granddaddy of this approach I guess is SETI at home. So screensavers, which is I think largely a Berkeley originating project. In fact, I think they open sourced the entire platform, it's called BOINK. I think the B in BOINK is Berkeley. And so the idea is if you've got a problem, a huge computing problem, which is amenable to being chunks and distributed, as you would with GPUs for example. And you don't care so much about the transit time between chunks getting done and getting back to where you actually put it all together. And there are a lot of such problems. Why not distribute it over the internet to thousands, even millions of computers that are just sitting there idle with altruistic people willing to do a little bit of number crunching for you. So that's the origin of this. The next generation of this is, well, there are also these idle brains. Like, sure there are computers, but why don't we use all the idle brains? The people just twiddling their thumbs watching TV all the time. Why don't we harness their brains by chunking problems that are actually doable by brains? The problem is it has to be fun, or you have to pay people. That's basically your two options. So what the most successful of these projects do is they gamify it. And not every problem is amenable to gamifying, for sure. But a few notable successes have really gone gangbusters. The first, so they're classification things, like identifying galaxies in clusters of stars, that kind of thing. The ones that have really taken off are a lot more sophisticated. The first one was called Foldit. And it's trying to ultimately solve the enigma of protein folding. How do you go from a string of amino acids just through physics to the final three-dimensional structure of a protein that actually does chemistry? We really haven't sorted that out yet. And this game sought to do better than computers at predicting that folding. So they later actually turned it around and they said, hey, to their expert now expert game folders, here's a target structure. Can you find a sequence that will fold into that? So now we're getting into protein design. And suddenly you've got an army of expert molecular designers at your disposal, as long as you can keep it fun. So the Foldit team out of University of Washington was a huge hit. And the thing that really tipped it was an article in The New York Times. So the way they explained it to me was like they were doing well, but they were still solving the problem of getting enough users. Not so much just to have a raw number of users, but to reach a kind of cultural tipping point with their gamers. Because for every 100,000 gamers, you're really only going to have 100 who are like the cream of the crop, who do 99% of the actual breakthrough work. And you can't have one without the other. So you can't generate a stable population of 100,000 volunteers without the romantic output that they're actually generating papers, like peer-reviewed papers and making breakthroughs and ultimately patents. And that's where it's going to get really interesting. Who gets the money? But you also can't have those experts without having that base pool first. It's just like sports, you know? A country that has a lot of limbic gold probably has a big population, with some exceptions. So these guys made an RNA folding version. It's actually a couple of post-docs from the Foldit project, who then went off, got jobs at Stanford and CMU, and made their own game for a different problem, RNA folding. And it's been doing very well, and they've been publishing papers. And in those papers, they cite the games called Eterna, and they cite the Eterna game community. The new thing is last week, a paper came out, which was written by Eterna papers. The whole thing was conceived by the players. It was their idea, and it was entirely their analysis of the data. The card-carrying scientists kind of came along after it had started and took part and made sure that it was all put together well, and probably were the ones who submitted the paper. But here's the catch. The original paper that was submitted to the Journal of Molecular Biology in Elsevier Journal had gamer names as the authors. And they did that because they had a discussion about this. Should we use real-world names, or the names that we all know each other by? And they decided, look, we all know each other by our game names. That is the identity under which we've done all this work. That's how we best recognize each other. That's how we've been crediting each other. And the journal editors kind of didn't notice it somehow. They told me that the thing that tripped it was part of the process of publishing a paper is that you have to go to PubMed and register it and do some kind of background paperwork. And of course that broke because some of the names had numbers in them. So suddenly the journal editor was forced to make a hard call on this. Are we really going to allow this? And he kind of freaked out and contacted the ethics board of Elsevier and they put a stop to the paper. So they actually froze the paper. They had already sent out the embargo press release. Journalists like me everywhere were already working on our kind of like cute stories. And suddenly I get worried that the paper isn't going to be published and it would be inappropriate to write about it. Nice try. So, you know, suddenly, I don't know if you've ever heard of the Streisand effect. It's named after a famous case of Barbara Streisand trying to legally force a journalist to not publish a picture of a huge California home. And the result of the lawsuit was that everyone saw that home. And that's what happens when you tell journalists not to write about something. So suddenly the story is really about this issue. And it turned out to be like a pretty productive thing. No one came out as a bad guy in the end. And it forced what would maybe have been a kind of background issue that causes trouble to multiple people in the future as these citizen science projects come up, was essentially like blared out and now everyone is thinking about it and they're probably going to have some policies in place. So this is journalism doing a good job. You know, adding, it's amplifying a signal that everyone actually wants. So there's the actual paper. And you can see the people are just like humans. So a lot of these people are actually gamers. In the end, to get it published in this journal, they had to reveal their full names. And they were cool about it. They were like, okay, that was if you insist. They weren't super upset. But people who are advocates for citizen science were upset. And this is often the case. Once you start a story that goes out into the public sphere, the loudest voices are often not even the people involved. It's often like advocates for one side or the other, which is good and bad aspects. So then just as I was finishing that story, we had another paper coming out in Science that had a pretty blockbuster result. And this one, this one's getting a little more adversarial. So I didn't learn about this from a press release. There was no press release. One of the researchers that I had worked with on a previous data set years ago to do with surveys in Afghanistan and Iraq trying to estimate deaths by violence and other problems, sent me an email saying there's a really interesting little meeting coming up. Its code name is Datafab. I was like, oh, it's Datafab. And he said, well, basically the research community, the social science research community that makes and runs and analyzes data from surveys are all getting together and they're addressing a problem that kind of like we all know is a problem. Datafabrication. And I said, how much Datafabrication are we talking about? And he said, just watch it. It's live streamed. And this is an example of like, there's no way I would have known about this without someone live streaming the event. So it was great. Live streams can be really powerful. Here's the thing that was, here's the thing that really like makes the story. This was the last slide of a presentation by two former Pew Research Center social scientists. One of them now works at Survey Monkey right here in the Bay Area and the other one is at Michigan and Princeton with dual appointments. Quietly, no news. I had not heard anything about this. Quietly over the last year at meetings, these guys have been presenting a statistical test that they developed. The basis of it is if you ask 100 people a set of 100 questions, how many, if it was totally random, like let's say it was like a fingerprint essentially, what proportion of their answers do you think would be duplicated among each other? So in other words, like what are the chances that two people will answer all 100 questions exactly the same? Very low. But it turns out as you get closer to 50, in fact already by 85% identity between answers, you're going to get closer to 5%. So you're likely to have 5 people out of those 100 who will have 85 identical answers. So just randomly. So they said maybe this could be the basis of a test for detecting unusual duplication in datasets. And these guys knew at first hand because they had found a bunch of duplication problems in their own data from surveys of the Arab world. The Arab world, not all of the places, but in many places is very dangerous and you get a problem called curb stoning. It's exactly what it sounds like. The interviewer sits on the curb, makes up the data. They call it curb stoning. It has a name. It's such a problem that it has a name. And they had identified lots of examples of these and it's not the case that you just write the exact same answer for every fake respondent. Often it's like semi-random. Often you'll take yourself as a respondent and you'll just sort of interview yourself answering a little bit differently each time. It's hard and tedious to make generally completely random data. And unfortunately this is not amenable to an older trick that uses something called Benford's Law. Benford's Law is just the random distribution of digits at the end of numbers essentially. It has a particular shape which you can expect and if you get data that deviates from it, it makes you look twice. This kind of data fabrication is not amenable to that kind of test. So they were pretty excited. Like hey, maybe we actually have a method for finding problems at least as an early screener. So they ran a thousand public data sets and 26% of them came up with a significant number of false data from internationals. So from international surveys. So if you look at OECD countries, it drops to 5%. So already that is telling you something. Like you need to be able to explain why is there this difference and is there an alternative explanation to curb stoning. So they've been quietly introducing this and they never say what the list of studies is all the past year. They just say we looked at a thousand hundred. Everyone knows where this data is coming from. And the bombshell was the last slide last week at their talk where they said okay, here are 309 of Pew's studies. Everything above the red line failed the test. And it's like a third. It was a 30%. And right after they spoke, the director of survey research at Pew got up and tore the hell out of this method. Finding every possible flaw and she was very well prepared. It doesn't account for the number of questions, the number of respondents. I was able to go back and forth. This is a position that science journalist often finds him or herself in when it's adversarial. I felt like I was the judge in a case, listening to two lawyers like making cases. And I've got a deadline and I have to like, I'm gonna be coming out with a story one way or the other. And so my job as efficiently and fairly as possible is to let each side have their say, for sure. But better than that, ask intelligent questions that avoids misunderstandings on my part. And also try and kind of cool things down between them. By the end, it was pretty clear to me that Pew is being defensive, maybe over defensive, and that the researchers aren't saying that every single one of those dots above the red line is an example of fraud. They're saying this is an early screening test. So once you kind of like let people kind of talk it out, things start to cool down and that's really helpful. The fact remains, like now we have to know how much of our survey data is full of fraudulent data and how much does that matter? So like now it's an open question, everyone's talking about it. We will as a community sooner or later get to the bottom of this. But this was rumbling around for a year. The Pew researchers actually submitted this as a paper and they got a legalistic sounding, scary letter from Pew saying, withdraw your paper. We heard that you submitted it. Same public that it was submitted. So this is where journalists are useful allies to scientists. We have certain protections that allow us to just ignore legalistic threats as long as we're on the right side of the law and essentially make sure that the truth be told. Pew hasn't been making much noise about the paper since then and it got accepted. So that's an ongoing, that's an ongoing controversy. Here's the most adversarial of all that I've been dealing with last week. So this is a website. Raise your hand if you've heard of SciHub. Interesting. So SciHub is a website which as it spells out clearly lets you get any paper. So it's basically like a library except here's how it works. It takes the DOI or title of the paper that you want. It goes to a peer-to-peer distributed database called LibGen. It sees if the paper is in there. If it's in there, it serves you up a PDF of it. LibGen currently has about 50 million papers. It's essentially the largest academic library on tap in the world. If it's not in LibGen, it then reaches into a grab bag of credentials, uses a random one I assume to log in to a university or other kind of site that has a license to get access to this article, gets it, gives you the PDF, and puts a copy in LibGen. So this has been going on since 2011, and it's been kind of quiet until essentially the past few weeks when people started covering what was going on with this lady. So that is Alexandra Eldakian. She's from Kazakhstan. She's a neuroscientist. She created CyGen as an undergrad slash master student for her own use. It was essentially an efficiency boost for what people were already doing, which is I can has PDF. So I can has PDF. Anyone ever used it? Why would you? Why would you? You didn't have access to something. So if you don't have access to paper and you really need it, in the olden days, what you do is you would tweet. I can has PDF. This paper, here's my email address. And some altruistic person would email you that PDF. And, you know, that's essentially how the stop gap worked. And her solution is, well, why don't we just kind of make a machine that does that for everyone? So as soon as anyone can has PDF, everyone has PDF. The only catch here is that it is technically illegal. And she is on the hook for at least millions, if not a billion dollars now, because the largest of the publishers, Elsevier, successfully got an injunction against her website. That was a matter of just switching it to a new URL. It's like whack-a-mole. And I think that the court case is still up in the air. I think it's still ongoing. And she hasn't hired a defense. So it looks like kind of a foregone conclusion that she will be sued successfully in a New York court, and I assume Elsevier is an international company. It's going to be very hard for her to travel for the rest of her life. I don't think she can come to America. If she were in America, she would have faced a similar fate to Aaron Schwartz, who killed himself after being caught downloading four million similar articles at MIT in a closet. I found her. So it was pretty easy actually. And we use telegram. It's a secure protocol. And she, here's Aaron. He's dead. Here she is a few days, like I guess this weekend. We were trying to figure out a protocol like, how do you prove, if you're chatting with someone, that they are who they say they are? And I was like, hey, let's make a password for both the face and use a normal password. I'm still not sure if this is even worth the effort. But I was like, OK, close your left eye and write this down and hold up and take a picture. So she did. And I did the same. And since we're both public figures and we know what each other looks like and it was done quickly, I reckon that's good enough. So I think, with high confidence, I'm actually communicating with Alexander L. Bakyan. So she sent me data because I asked for it. And this is what journalists do. They ask for stuff. And sometimes people give it to you. So I now have the past six months of download events. And I bought an IP to geolocation database and worked with her to figure out essentially where people are download stuff. So this is minus China and Korea and a few other countries. And China is actually the number one along with India. They're very close. So imagine China being really red. But here's what the distribution looks like. And there's some... I don't know what the story is yet. Oh, yeah. I got excited.