 Thank you very much, Stefan, for that kind introduction. So what I'm going to talk about today, mostly about the current practice of data journalism, but also to try and think about what I hope will be some of the future of data journalism, which we're starting to see, but not too much so far, actually integrating data science into this practice. So whenever I talk to people who aren't journalists, I often start off with this slide by way of lowering expectations that I might have any expertise in particular here. This really is a pretty accurate cartoon representation of how reporters actually go about doing their work. But when I put this slide in, I was reminded for this audience of an interview I did for New Scientist Magazine a few years ago with a data scientist, Jeremy Howard, who at that time was president of Kaggle, that company that was running data science competitions. I think they've changed their mission a little bit. And he made this very provocative comment about the value or otherwise as he felt of specialist domain knowledge. Not surprisingly, being a journalist, we pulled that out and made it the headline. But he had a serious point here. What he was saying is that specialist knowledge in a particular subject area can actually channel your thinking into certain predefined areas, and that what he likes to call dumb algorithms like the random forest applied to the data can bring new insights. That's the point he was really making there. So is there some common ground between data journalism and data science? Well, I hope so, as we'll go on to see. So a little bit more about journalism and a little bit of naval gazing here. The image on the left comes from a special issue of the Columbia Journalism Review from sometime in 2013, where they asked a whole bunch of the great and the good in journalism to answer this question, what is journalism for? I particularly like this answer. And then I thought, well, if I'm showing you that, I should maybe come up with what I think some of the answers are. And I say this now at risk of seeming terribly pious, and I'm reminded of a lecturer at the J School who unfortunately we lost to cancer just over the Christmas period, Paul Grabowitz, who I was looking at a video of him speaking to students last night, and he said, always remember as a journalist, don't take yourself too seriously. So at the risk of taking myself too seriously, what do I think journalism is for? Well, I think part of our job is we live in a very complex world. We hopefully are trying to make some sense of that for our audience. I think there's also a public service role to journalism. Part of what we do is trying to expose and prevent wrongdoing or public harms. I mean, we only have to think about lead and drinking water for a story that's in the news right now. Linked to the above pretty clearly, holding those in positions of power and influence to account. And yeah, don't take ourselves too seriously. We can entertain. And indeed, if we don't entertain in some way, those lofty goals are probably gonna get lost because nobody is gonna spend the time looking at our work. Okay, so now as the data scientists in the room will probably be looking at this and thinking, well, you can do a lot of this with data, and I absolutely agree. So I wanna talk a little bit about data journalism, which has become a bit of a sort of buzz in recent years. And we see things like this. This was basically a video project done as a night fellowship, night foundation supports a lot of projects in journalism. And this was kind of looking at how the rise of open data and tools for data analysis and visualization had fueled a new form of journalism. I'm gonna argue that it isn't that new. But anyway, and that's from a few years ago in 2010, I think. It's been presented as a thing of the future, right? And if the inventor of the worldwide web says it's true, it must be, right? Quoted here in The Guardian again, a few years ago now in 2010, I think. And then there was a bit of a tipping point around the last presidential election. Now we're crossing in the midst of a new one, where this guy, Nate Silver, did something that I don't think would surprise too many people in this room. He aggregated polling data. Then he ran statistical simulations based on certain assumptions from that polling data. And not only he, but he was the most visible, was able to pretty successfully predict the outcome of the electoral college vote. It was actually, I think he underestimated it. It was 332 to 206. But it was this one that sort of, the established political journalists had real trouble getting their head around. The fact that you could, in the same way you might predict whether it was gonna rain today, run these statistical simulations on your polling data and come up with a probability that one candidate was gonna win. And it just didn't fit with the established sort of media narrative that this was a close horse race. When it, yeah, well, all elections are kinda close. They never go like 70, 30 in the popular vote. And Silver was very, very confident, Nate Silver, the Barack Obama was gonna win and he also correctly called the results in 50 out of 50 states. Now, I don't think anybody here in this room is gonna be too surprised that that'd be something you could do. But believe me, it caused some consternation among media pundits. Okay, and now we have 538, which was Nate Silver's thing, is now a brand in its own right. It's no longer at the New York Times, which is where it used to be. It's owned by ESPN, the Sports Broadcaster. And the New York Times' response to that and other operations out there is the upshot. And these are, I guess, journalistic brands that certainly have some sort of data analysis at the heart of what they do. Okay, but as I said, so there's definitely a thing. There's definitely something that's changed and the rise of open data and open source tools is definitely part of that. But I'm gonna talk a little bit now about the history of data journalism. And I wanna draw your attention to these initials over on the right of this webpage. C-A-R, what does C-A-R mean? C-A-R stands for Computer Assisted Reporting, which I think nowadays seems a terribly quaint title for what we now call data journalism, but it tells you something about how long this has been around, where using computers in journalism was like a woo, this was an amazing thing that we were doing. If you want to meet data journalists other than me, and you may not after this, I don't know, but if you do, the place to go is this conference, the C-A-R conference, which will happen in Denver, Colorado in March, and I'll be talking about it and a few other people will as well, lots of other people will as well. You'll find everybody there from people who are trying to dip their feet in the water, playing around with a spreadsheet for the first time to some people who are doing some stuff I'll show you later, which is starting to apply machine learning to problems in journalism. I want to acknowledge the role of this organization, investigative reporters and editors. They are fantastic for nurturing data journalism. They're really the main organization. They're training, workshops are great. They've suddenly brought my skills on from being pretty much an idiot to hopefully being a little bit less of an idiot over the past few years. So IRE, fantastic organization. If you're interested in data journalism, I really recommend its conferences and workshops. Now going back to that, not exactly rocket science and going back to what I said about sort of lowering expectations maybe a little bit, a lot of what we do in data journalism actually is really not that sophisticated. It doesn't actually need to be that sophisticated. This is a slide that all of my students and there's I think at least two in the room have seen before. It's when I talk to people who are working with data and journalism and particularly people who are starting up, I talk about interviewing data, asking questions of data and then I talk, well, how do you interview data? And these are the common operations which is probably like, I don't know, two thirds, three quarters of what I spent my time doing. It's kind of rinse and repeat on these very basic operations with data which I may be doing in a database. I may be doing with some code. I may be making graphics, but basically that's what I'm doing a lot of the time. And a lot of the time it does not need to be much more sophisticated than that. Now, so I said, this has been around for a while so I wanna dip back into history here and talk a little bit about an individual who within that world of what used to be called computer assisted reporting, the man who was seen as I guess the father, perhaps the grandfather now of that, Philip Meyer. So Philip Meyer was for many years a professor of journalism at the University of North Carolina in Chapel Hill. But what he's known for, he was one of the first people to really start using quantitative methods in journalism and bringing a training in social science. He came out of social science to what he did then with night's newspapers. He's also the author of a book which I recommend to my students. First published in 1973, now in its fourth edition, a really excellent book called Precision Journalism. So what did he do? Well, among other things he won a Pulitzer Prize for the newspaper he was working for at the time, the Detroit Free Press, but what was the work? Well, some of you may know that in 1967 there was a very devastating riot in the city of Detroit. Some of the, just the statistics around that riot are over on there on the left. Arguably the city has probably never really recovered from those events. Entire blocks were burned, the National Guard were on the streets. This was a very, very serious civil disturbance. But when we look back on this now, I guess particularly in the light of recent events like Ferguson and the protests of the Black Lives Matter movement, what people were arguing were the causes that may have triggered this were kind of interesting. So Detroit and the civic leaders in Detroit all the Detroit thought of itself as a sort of progressive go ahead city. And the explanations that were being discussed in the media at the time, well, here are a couple of the leading ones that one was that this was economic. This was about economic opportunity. And that was the reason for it. Another one was that this was in some way not about Detroit, it was about the South. It was this was people, African Americans who'd been longer pressed in the Southern United States who'd moved to Northern cities like Detroit and were venting pent up rage. This was kind of a serious theory for what was going on here. And Philip Meyer did a very simple but a rather profound thing. He recruited Canvases from the African American community and he devised a fairly carefully thought through questionnaire and he did survey research. And what he showed was that actually those who had rioted were just as likely to be college graduates as high school dropouts. That didn't really fit with the purely economic explanation. He also said that this whole thing about the South was a complete red heading, red heading, red herring. Those from the South who were born in the South are actually much less likely to have participated in the riots. And the tension turned from the results of his survey instead to pervasive racial discrimination in particularly placing. I mean, we're still dealing with these college officers. This will be no surprise to us now but also in opportunities for housing in the city of Detroit. And the reporting that he was at the heart of one of who that surprised for the Detroit pre-press. So I'm now gonna just sort of go through and most of these again are deliberately gonna be fairly old stories. So you get a sense that data and journalism has been around for a while. Common types of stories that are done by people who call themselves data journalists. And this is a fairly classic example of a use of a relational database and structured query language to do a story that otherwise will be very difficult and time consuming to do. And it basically involves taking two sets of data that were not ever intended to be tables within a relational database and making them into them and running queries to join data across the tables. So this is from New York Newsday. I think it was published in 1999 originally. And basically it's using disciplinary actions against doctors as one of its sources of data. That's done at the state level and these are things like doctors who put patients at risk through poor patient care, sexual misconduct, financial fraud, the type of people you wouldn't necessarily want to have giving you medical care. And they were looking at the doctors who were recommended to their patients by the major health insurers in New York and finding more overlap between those two sets of data than perhaps you might think would be ideal. So that's a very kind of common data story, that type of thing. We all like maps. I particularly like maps and geographic information systems and geographic data are used in journalism. This is I think a fairly nice example. I also like it because it's an example of aircraft assisted reporting as opposed to computer assisted reporting as well. The background to this is in December 2007, there were some severe landslides which caused a lot of property destruction in the state of Washington. The Seattle Times was interested in this and this story was published a few months later. First thing the reporters did was got up in a plane and took some photos like that. And they were seeing that these landslides seemed to have originated on steep slopes that had been clear felled, but they wanted to quantify that. So how did they go about quantifying that while they got through public records requests, GIS data on the areas that had been clear surveys, prior surveys of the slopes that were thought to present the biggest hazards. And I won't go into the deep where the landslides had occurred and hazards, that was the methodology which was fairly simple, but they basically showed that yes indeed these landslides had originated in these risky slopes that disproportionately that had been clear felled. They also pulled the documents, the permits and they showed that really there had been very little consideration given to those risks when the permits were given to the logging company. So they really were rubber stamped. Okay, here's some example of some fairly simple statistics and graphical analysis being used for a story, but I think it's a nice example, 2004, and it's about standardized educational testing. And this started as many morning news in things in journalism do within anecdotes with probably somebody coming to the reporters and saying there's something weird going on at this school. We think they're cheating on the tests. They reported on that and they did a story, but then they pulled all of the testing data for the entire state. And at it's simplest, the analysis is really straightforward and it's a regression analysis and it's shown on this scatter plot and I'm gonna just talk about this one. So what do we got on here? We've got an aggregate measure for each school, the testing scores, there are 3000 or so schools across the state. We've got the 2003 third grade reading scores here on the x-axis and on the y-axis we have the same kids a year later. So it's the 2004 fourth grade. Now of course you would expect there to be a strong correlation there. I mean, what we see on the scatter plot with the linear regression line is absolutely what you'd expect to see. They were interested in the outliers. They were interested in the weird ones like this one, Harold Bird Elementary School. Why had those scores jumped up so much? It would be great if it was a fantastic teacher. It wasn't a fantastic teacher. It was organized, teacher led, cheating on the tests. And that's the most extreme one but they were also looking at others where they found examples of this happening. Okay, and I now wanna just bring it back to the present because I've talked a lot about the past. This is a story from some of my colleagues at BuzzFeed News. John Templon works on the data team at BuzzFeed News. Now I'm actually a science reporter. I work with data a lot but I'm not on the core data journalism team at BuzzFeed News, John is. This is a story and I wanna be very careful about what I say about this. This is not a story that proves match fixing in tennis by any individual. And I think we do need to be really careful in data journalism about the limits of what we can say. But what John's analysis does, and I'll show it you in a minute, is it provides evidence of patterns that are highly suspicious of match fixing. And it supported a whole lot of documentary material as well. So the other reporter Heidi Blake had obtained basically a whole series of investigations which had been handed over to the tennis authorities but had apparently not been acted on. And they had all of those documents as well. So I really like the fact that I can show you this. So this is a growing trend, not just in science and in data science but in data journalism as well of actually showing your work, showing where possible the data. I'm not sure that was possible in this case because they took great care not to identify the individuals but to show your methodology, your analysis and your code. So this is on GitHub. I can probably switch to the web and we can just take a little look at what John did. So he starts with betting data. He has a lot of betting data. Then he's focusing down on the matches where the odds shifted a lot from where the book was opened and when the match took place. And then he's focusing down even more on the players who end up losing a lot of those matches apparently not as the form would suggest should happen. So they have various criteria and he ends up with 39 players. And then simulations again. He runs a whole bunch of statistical simulations of how you would expect these matches to pan out. And then narrates that down further to the players who have what looked to be very suspicious patterns that you would not expect to occur by chance. And I believe we have, I'm sure we have many people in this room who will know what this is and I Python notebook which gives all of the code for his analysis. So show your work, well done John for that. So anybody who wants to do anything similar can go repeat it. So let me get back to my current slide and I've already showed you that. So far I've talked about what other journalists have done mostly because I find their work pretty nice and impressive. You may be wondering, well, who is this guy? What does he know? So I thought I should say that I do some work with data myself. Now I'm not claiming that this is as interesting or as good as some of the work I've shown you already but I just wanna give you a flavor of the sort of stuff I do. And I'll start with something that's just really, really simple to again emphasize the point that a lot of what we do with data and journalism doesn't need to be that sophisticated. This is one I think I published about a week ago and I forgot there was another slide, the tools I use commonly. You'll recognize a lot of these, I do mapping work. Open Refine, if you're not familiar with it is a very useful tool for data cleaning. It has some nice clustering algorithms for messy text fields, reconciling those. The one I try and use as much as I possibly can and is kind of my comfort zone is R and R Studio, that's what I generally use. And just in a nod to the fact that this is clearly not an analysis tool if you're familiar with it. But I'm now, I'm an old print journalist but now I am an exclusively digital journalist. None of my work ends up on paper. And a lot of my work gets put into graphical form. So this is just a nod to the fact that that involves a lot of JavaScript typically. My colleague Adam who is in the room will know how much I dislike JavaScript. But anyway, here's a very simple story. It's just told through an animated GIF or GIF. There are a whole series of these in the article. Well, I don't need to tell you what's going on here, it's pretty obvious. It's mostly, the origin of the story was a tweet. Somebody said, oh, somebody should do that. We thought, yeah, that wouldn't be that hard. The National Oceanic and Atmospheric Administration has already provided a bunch of geodata which shows expected inundation at differing levels of sea level rise. We can geocode Trump's properties pretty easily. That's not hard to do. That's actually a Trump branded one, not a Trump owned one. But we thought they were kind of all fair game for this. And we did explain the difference in the story. And well, there was a story that went along with this, but really you kind of don't need it. I mean, this is kind of why people are gonna be certainly amused by this. So I'm talking about entertaining as well. I mean, people were amused by this story. But for me, there is a serious point to this which comes about sort of trying to make sense of this complex world. I'm actually using this as a device in part to talk about a very serious problem facing coastal cities and talking a little bit about what Miami Beach is actually doing. Cause Miami Beach is doing some interesting stuff. This isn't in Florida. This is in, that's Waikiki Beach in Hawaii which faces similar problems to much of Southern Florida. Okay, more climate. I do write about things other than climate, but I just put through this in as well as an example of something that doesn't have a story as such. I think the graphic is the story. This was a project I worked on as a visualization essentially of NASA's historical surface temperature record data which has been something that certain people don't believe is true. But what it allows you to do is to drop a pin anywhere on the planet and you will, that graph on the right will adjust and show you the historical temperature range where you live or what any location you care about. This is kind of based on the idea in journalism that all news is local. You've got to make these abstract, big stories relevant to the individual. It's an attempt to do that. Just as an aside, I would not do it. I did this with a journalist and programmer called Chris Amiko a few years ago. I think it was 2012, we first published it and updated it subsequently. But I wouldn't do this now. And the reason is this. These big set piece, map-based visualizations do not work well on mobile phones. And now you see me doing stuff like this, animated GIFs. It's kind of an interesting example of sort of technology so in some way forcing us to be less sophisticated so we can communicate more effectively. I threw this in for two reasons. One, because I mentioned my genome and you might be intrigued by that. But also because I think it allows me to just mention another point that we might discuss later on which is that we are not limited as journalists to working with the data that's been put into the public domain by the government or by whoever's done it or by scientists and so on. We can generate our own data from the real world. So here's my prop. What this story was about, could somebody take this or more accurately, probably a glass or something less permeable than this without being an expert geneticist and with nothing more sophisticated than a telephone, a credit card, use commercial services to extract, get my DNA extracted and get it analyzed and find out quite a bit about the health risks I may face as a result of the accident of my genome. And the answer was surprisingly easy. It was to do that and I did a whole bunch of other stories looking at that. But the bigger point here is this, collecting your own data. And I'm really excited about the opportunities for using automated sensors in journalism. I hope to be doing some projects in this area in the coming months with something called the BuzzFeed Open Lab, which Stefan has already visited. So I just flag it up as a thing that might happen in the future. And this is probably a more typical data story. Very, apologies if you don't come from a science background but this is a very insider, geeky science data story. I wrote it for Nature in 2014. I used to work for Nature, I was freelancing at this time. And basically it's looking at the journal The Proceedings of the National Academy of Sciences better known as PNAS. And the interesting thing about PNAS, if you're a scientist or one of the interesting things about PNAS is that not only does it peer review papers in the normal way, but if you are an eminent member of the academy, you can essentially organize your own peer review. They are called contributed papers. You get to choose the reviewers, you get to choose how you respond to them. And there are some other levels of scrutiny as well, but basically it's self-organized peer review. And so I pitched to the editors at Nature, I said, hey look, I think your readers are gonna be, I've never seen anybody look at who's using this the most often. And I think your readers are gonna be absolutely fascinated by that. They'll wanna read that. So that was the minimum story I pitched. But I also suggested, I think there's something else I can do, because if we can get citation data, and I'm pretty sure I can, then we'll be able to look at the impact of those papers in terms of their citation by other researchers. So that's what I did, and this is the kind of basic story. And it turned out that these people they called the power users, included quite a lot of members of the journals editorial board, which was kind of interesting. Maybe they put in the work, maybe they deserved a free ride on some of their papers. You can argue it either way. And this is a bit of statistical analysis I did on the citations. So it's basically just a, it's a multiple regression. The models I'm looking at need to include a few things other than the type of paper contributed or edited in a normal way. Communicated is another type, which doesn't exist anymore. But I need to account for some other variables as well, like how long the paper had actually been published for. That affects how many citations it gets. Like, was it open access or not? PNES has an open access option. Open access papers tend to be cited more often. And crucially, the scientific discipline, which luckily was in the metadata of the papers. So I did this, wrote some Python scripts to scrape all these papers, 10 years of papers down from the web, and then I'm just extracting stuff from the HTML in the metadata. But I can start here to talk a little bit about some of the problems and the challenges we face, which is where I wanna go for the rest of the time I have available to me here. So big problem is that the first bit of this story depends crucially on getting one member of the academy identified and not confused with anybody else and all of their papers getting attached to that one name. Turns out that's a huge problem. That's what took all the time with this story because people include their initials or not, or if they have an accent on their name, it goes on the paper or it doesn't. This was, I used Open Refine for that, but it still, it took me, I think, more than an entire week of work to get that sorted out. And these are the type of problems where I start wondering that are there algorithmic ways to do this, to save me? That is certainly not the most enjoyable use of my time. I can tell you that. But I felt I had to do it, otherwise the story didn't exist and it would be great to have those sort of problems automated. But here's another story and I think it really shows this is a much worse problem. So this is very similar to that doctor's story I've showed you already and I was again using state medical board disciplinary data, but here I'm interested in, well, this question was what came out of it. Well, why indeed are dopedicted, disgraced doctors running clinical trials for new experimental drugs? And I found dozens of them. The data I'm using, the Food and Drug Administration does have a database of clinical researchers. And I'm joining that to disciplinary data from state medical boards and also the FDA has some disciplinary actions as well that I was able to look at. But now we get into the problems, dirty, incomplete data. So this is that FDA database, just pulled into a spreadsheet and the enigmatic field names spelled out so we can see what we're looking at. But you'll see, I mean, just look at the last name, there's clearly some issues with that. If I sorted this date column, you'd see dates that hadn't happened yet. You haven't happened yet. You'd see dates that preceded the Norman conquest of the country of my birth. There's clearly some problems with that data, but that's not the biggest problem. The biggest problem, if you go back and read this story, as you'll find, of course, I tell it through the stories of individual doctors and what happened. The doctor who leads the story was an emergency room doctor in California who had a number of issues, but the one, the most serious one, he was responsible for a failure of diagnosis and treatment that caused an unnecessary amputation. And then subsequently, he was disciplined for that, he was put on probation. But then subsequently, he was hired to run a very high stakes, high risk clinical trial involving cancer and immunotherapy and terminally ill patients. That wouldn't have been my choice, I don't think, of the best person to run that trial. And it turns out that his site got inspected by the FDA and the trial got shut down. That's actually very rare for that to happen. The FDA was not happy with the way it was being run. But he's not in here. So how did I find him? Well, I found him in the FDA disciplinary data, but why isn't he in here? And this is what we have to wrestle with as data journalists all the time. So there is a government form, I can't remember the number of it, it might be something like Form 1572 or something like that, that you fill in with this information if you're a drug company and you have a doctor that you're intending to hire to run a clinical trial. By the way, if anybody doesn't know about clinical research, it primarily does not happen in big academic medical centers. It happens in doctors' offices around the country. So if you fill in the form, it gets put into this database, it may be entered with some problems, but it gets into there. But you don't have to fill in the form. The drug company can send a CV in instead. And if they send a CV in, the person doesn't end up in the database, like that guy. Now here's the other problem. If we look at this, like what I really want to know as a reporter is, well, what drug? What company? What clinical trial? No information on this in here whatsoever. We just know that Dot Wends, whoever Dot Wends is, was hired for a bunch of clinical trials over the years. The FDA views that as being commercially sensitive information, which is not for the public domain. I can't get it through the Freedom of Information Act. Now there is another database, clinicaltrials.gov. And this is a clinical trial that one of the other doctors I wrote about who had a long history, it's actually a pretty sad story of repeated brushes with the California State Medical Board culminated eventually in him losing his license after he continued prescribing when he was suspended. And he was also addicted to opioids himself. Here he's actually involved in recruiting patients and managing patients of an opioid drug. Now I don't think that's the best idea. I don't think he's the best choice. Having spoke to the guy, I do see it more as a sad case than he was a bad man. But I was only able to find that because that drug company happened to have listed his clinical trials site, which had his name in it, in the information it submitted to clinicaltrials.gov. Mostly they don't. You don't get that. Or often they don't. So I was piecing together this information. And data journalism often seems like this. I would have loved that story to put phone numbers on it and say, are discipline doctors more likely to be hired? I mean, I don't think they should be hired at all, most likely, but are they more likely to be hired? I can't answer that because I'm dealing with this disconnected, incomplete data and it often feels like what we're doing is trying to understand and tell a story by looking through a keyhole, shining a light into a darkened room. And these are the sorts of problems. Now, I don't know if data science can help when the data is just wrong, incomplete, missing, siloed, but if you have any ideas as to how we can help, how to help us deal with these problems, I'm all ears on them. Another challenge. Somebody I've already spoken to today deals with text analysis, I know. Lots of the data in journalism is not. At least that stuff's in tables with fields which you can work with fairly easily, but lots of it is stuff like this. It might be Hillary Clinton's emails. It might be documents relating to a whole series of legal cases. It might be here we have the military's reports on the progress of the wars in Iraq and Afghanistan put into the public domain by WikiLeaks. But here I think we're starting to see some examples of data science help reporters with the problems they have. So this is some work by a journalist called Jonathan Stray who at the time was working with the Associated Press. So what has he done here? He has done text mining. So there are these significant action or SIG Act reports. This is showing, I think there are more than 10,000 of them in total from the bloodiest month of the Iraqi insurgents, December 2006. What he has done, he's used something called, is it TFIDF? It's the one that basically gives you the words that are characteristic of a document in the context of the frequency in the overall collection of documents. So then he has those terms for each report and then he can use clustering and he then has used network analysis to draw the clusters together in a force-directed layout and we start to see things emerging. Now the colors are not out of the clustering. The colors come from the classifications that the military analysts wrote on the reports. So dark blue are criminal events. There is a particularly nasty collection of criminal events about abductions and murders here. The red is explosive hazards. The most characteristic terms are truck and tanker. This again tells us something about what was going on. There were fuel shortages in Iraq. There were trucks of kerosene on the street that people were lining up with cans and filling so they could cook their food. They were being blown up and people were being killed. So I think this is, you know, nobody is gonna go as the time and the energy and the concentration span to read those hundreds of thousands of documents in their entirety but this can give us clues as to where we need to look for the stories and the great news is this is now available. There's a tool for everybody to use and it's called overview. So you can go and look at that on the web. It does text analysis and it does clustering and then it drops the clusters into folders so you can go and look at the documents and see what's interesting. So I do not have time to show you this video, I'm afraid, but I do highly recommend it and also I don't think the audio will work over the streaming. But if you're interested, I can show it you later. So this is a video from that CAR conference a couple of years ago. The journalist who's talking is called Chase Davis. He is one of the few journalists who are using machine learning and journalism now. He used to work at the Center for Investigative Reporting here in the Bay Area. He now works for the New York Times. The New York Times just tends to suck up a lot of the best journalists. Anyway, what Chase's talk is called, so you can search for it, is five algorithms in five minutes. You will find it on Vimeo and he talks about some algorithms that the data scientists in the room, I'm sure will be very familiar with but how Chase thinks they can be applied to journalistic problems. So I'll move on and I'm just gonna talk about some stories where we are starting to see this approach being used in journalism. This is one of Chase's. I want to start with that. When he was at the Center for Investigative Reporting, it's very, very similar actually to the whole overview thing. Again, he's using the same approach to get the words characteristic to documents. In this case, the documents are bills before the California legislature. What Chase was interested in is bills come around again and again and most of the time they don't get passed or even if they get passed, they may not get signed by the governor. But they come around with different titles and they usually come around with different sponsors. So it's not a trivial task to identify the ones that we've seen before. So Chase used text mining to do that and then he did pair-wise comparisons with something I think called cosine similarity. I think some people in the room will understand that better than I do to find the ones that seem to be most likely to be duplicates. And the story he produces is basically identifying the bills that Schwarzenegger was repeatedly vetoing but when Jerry Brown comes in as governor, suddenly get passed into law and signed by the governor. Now these couple of examples, I think I think are particularly interesting and show where we may be able to go. I'd also love at this point to be like presenting my own machine learning story. I am actually working on one but unfortunately it's a little bit sensitive and I can't really talk about it but it's looking very promising but it's conceptually a bit similar to this story which was from the New York Times. A reporter here, Heroka Tabuchi, a Pulitzer Prize winner, works out of Tokyo for the New York Times. And she was interested in airbags manufactured by this company called Takata. So she had gone through thousands, I think more than 30,000 reports filed incident reports with the National Highway Safety, what's it called? National Highway Traffic Safety Administration and she'd flagged about a couple of thousand of them. I mean, you get the idea how hard actually some journalists do work as being suspicious of a problem with a Takata airbag. But then she teamed up with a data scientist who works for the New York Times called Dale, Dale, sorry, Kim. And what he did is he used logistic regression model to essentially train on the one she'd flagged and then we're interested in, well, do any others come up that she might have missed? As a result of that, it ends up being a fairly small part of the story which has led, I believe, to the resignation of senior people and certainly led to recalls. But she'd identified several dozen, I think, serious injuries that seem to be caused by these airbags. They added, I think, seven more to that as a result of the machine learning. So it's a small contribution to the story but I think it definitely shows where we might be able to go. And here's another example. This is very recent. This is from October last year and it's from the Los Angeles Times. There's actually a follow-up story to a previous investigation they did but something that definitely wouldn't have been possible without machine learning, just for time and resources. So the first two reporters here, Ben Poston and Joel Rubin, had done a story a year earlier where essentially they'd gone through one year of crime reports filed by the Los Angeles Police Department. And they had found that there seemed to be a bit of a problem with a basically under misclassification of serious assaults as less serious forms of violent crime. So the crime stats and if you know anything about the way policing works, it's actually very stat driven nowadays. So the numbers matter and they were wrong. Well, they did that for one year and I can only imagine the kind of state of mind they're in after reading all those documents. This is where Anthony Pesky comes in, the data reporter and trains an algorithm, actually a couple of algorithms, to again do a similar thing. So we know the ones you've identified from this one year. Let's use that to train these algorithms and then go across eight years of data and see what happens. Has this been a systematic problem? Has it grown worse over time? Is it recent and so on? They indeed find that it had been going on at a similar rate through the entire period. They do go back and they do some sampling and some manual inspection of the data to see whether the algorithm's performing correctly. But this is a story that they would have done their investigation and I'm sure they would have whether them walked away from it without saying actually now we can expand this and we can let the algorithms do the work for us. And again, showing his work. So we've got an iPython notebook so anybody can analyze the LAPD's data now. So what I want to finish with if I have just a few more minutes is describe some reporting I did a few years ago which has left me deeply unsatisfied and I feel it's unfinished business. So a few years ago I worked with a very sharp-eyed and obsessive, I mean that in the nicest possible way. Reporter called Eugenie Samuel Reich on this without which this wouldn't have been really possible without her intense dedication to the task. I'm interested in stem cell research. I used to cover it quite a lot. And I knew of some high profile papers that people could not replicate. And scientific misconduct was in the air at the time. It was shortly after the incident in Korea where if you don't know about that I can tell you about it later. So we weren't looking at these groups work and we found some problems with the work of one group at the University of Minnesota. Now basically this has led to a couple of papers were retracted, actually the papers turned out came from two groups at the same institution. There's been it may be into double figures now of papers that were corrected. As we'll see, I think there are some issues with if one person is guilty of misconduct is it only one? I think that's a very open question. And it leaves me concerned about how widespread a problem this might be. I don't think most likely the University of Minnesota is that bad and that's my concern. So we talked about text is kind of easy and we've seen automated tools for text analysis making a difference and cleaning up the scientific literature. So plagiarism is a problem in science and a problem that people would want to reduce. Plagiarism detection is not that computationally difficult. There is software to do that. There is a database that delightfully named deja vu database of suspiciously similar papers. And this little graph shows that the existence of these tools may be making a difference. I don't think they're used as widely as they might. But certainly there's a potential there for basically data to clean up the scientific literature. But images are hard, I think, images are much harder. So I'll just sort of talk a little bit about the sorts of things we found. So this was a couple of years into our reporting. Finally there was a finding of misconduct in a paper retracted, which is in the journal Blood. And you'll see the description there of the sort of problems. And I'm just gonna call it inappropriate image manipulation. I think the phrase misconduct can be fairly loaded. But it's certainly not good scientific practice what we found going on. So what did we find going on? What am I talking about? So a lot of data in biology now is like this. It's images of electrophoretic gels showing the existence of here proteins. Well if you look at this, this is a panel of images from the blood paper. This is a horrible PDF of some images from the same research group, one of their patents as it happens. Well if you look at it, well that is the same image essentially as that. But they're describing different experiments and different proteins. And the explanation we got for a lot of this, so it's just we mix them up. We got the wrong images in the wrong place. It's a simple accident. Where the stories we were doing started to get more traction was that. So the one that I've ringed in pink. The issue with that, if you look at it, it's a horizontal flip. Well they are horizontal flips of one another. And this one has this extra lane on the left which later forensic analysis showed had been spliced on. And that was really what sort of really kicked all of this off. But then subsequently, now this is from the same research group, but the researcher who was found guilty of misconduct in that prior case was not an author on this paper. This is in PNAS. I don't have a particular problem with PNAS. It's coincidental. So if we take this image, and this time, don't not flip it horizontally, but rotate it through 180 degrees, you get to that image. If you then mess around with the aspect ratios a little bit, you end up with that image, which it turns out can be completely superimposed over that image. They're the same thing. So that's the type of thing that we can get in the scientific literature. That hasn't been corrected, by the way. That paper still stands, and that figure still stands. And while we were doing this, we stumbled across another lab at the same institute that actually had way more problems with inappropriate image use. Just, we stumbled upon it because the principal investigator at that lab also collaborated with this group. And this is the sort of thing we were seeing. Lots of internal duplications, lots of weird-looking splicing of stuff together. And there were six papers from this group had to be corrected subsequently. Not a finding of misconduct again, which I can talk about if you're interested. Why not? But clearly problems. By the way, you can see we're starting to try and give ourselves a helping hand and make it easier to see where in Photoshop, we're using a little action that exaggerates the gray scale and then imposes a false color gradient over it. It just makes these things easier to see. But still, in the end, we're doing this manually and with our own eyes, and that's the problem. So at this point, it's like this is. This is crazy, like, we've only just like gone into one institute and we found now problems that a second research group and the second research group is worse than the first one. How widespread is this? So at that point, I actually proposed to my editors, I said, hey look, I could do a big server and we can go through all these papers and we can do it systematically and they said, no way. No way, you'll spend the rest of your life doing this, it'll be all you do and that's not what we pay you for. Quite correctly, I think, they said that to me. But this has left me with this kind of feeling that there is unfinished business here that maybe some sort of combination of image search and clever algorithms can solve this problem and can, well not solve this problem, but start to address in an automated way the quality of the image data that is reported in the scientific literature. I think that would be great. Now just in case you're not interested in cell biology and don't care about that, I do think images are a challenge more generally and I'll just flash up, I don't wanna dwell on it because it's so unpleasant even to think about, but this is a story I wrote a few years ago about Google and other technology companies actually trying to apply image search to the forensic investigation of child pornography and bring those criminals to justice. So I think, if we start thinking creatively about what images are used for and can data science in some way be applied generalistically and forensically to that, I think there's some interesting things to come. So I'm gonna stop talking now, I think I've filled up my allotted time. I'm happy to chat about anything I may have raised there. I apologize if my non-expertise has led me to say some things that you're wishing to correct me on, but please do feel free to correct me.