 Introducing the next talk just Getting instructing the lurking stories behind the numbers mr. Stefan Vermeier will introduce talk to us about Computing numbers with an application It to the problems of our society, please give him a warm applause Yeah, thank you Obviously, this is a reference to a touring paper. I got many questions, but those of you who got it thumbs up And this talk will basically combine computer science and Journalism I'm currently a data journalist. I joined a newsroom about one and a half years ago. It's called corrective It's a nonprofit newsroom based in Berlin and we do long-term investigations and we are member based and currently a bit foundation funded and We are doing investigative journalism investigative journalism What is that? One good example now from popular culture will be this one This is the movie spotlight that just came out in the US will come out soon in Germany And it's the story about child abuse by the Catholic priests in the Boston metropolitan area and this is the team that Basically uncovered it or rather it's not these are the actors and that play the investigative journalists and The whole film is actually quite a good representation of how investigative journalists work and especially so this film depicts a story from 2003 a 2001 2000 I think 2001 and Yeah, it's a bit slightly over dramatized of course because it's like a Hollywood film, but it actually depicts a investigative journalists quite accurately Of course, it also did pay it depicts the gender balance quite accurately As you can see here, but it is getting better So in Germany, for example, the leading data teams from a spiegel Bayer's Rundfunk and SRF in Switzerland and are led by women and also the Organization that represents investigative journalists in Germany has mostly women on its board. So Since still many investigative journalists Like too many are still men. So women get into the field as well So the spotlight team so what they did is they Got a tip and then they looked at data and they collected data About priests who were moving between the different parishes. So the different districts in the metropolitan area of Boston And every time there was an abuse scandal They got a sick leave or something similar and got moved to a different district to cover up The scandal and to make it appear that it's just like a single case The truth would be that they discovered that many more cases were present Present and they They basically uncovered that it was like a systemic problem And this is actually one of the core pieces of investigative journalism that you don't show and that a single thing is wrong Like this man did something bad, but that the whole system is set up in a way that many people do many bad things and and So they actually they used books That displayed where priests were moving like where is the priests in which year that like Book for every year and then they went through it and typed it in a computer and they actually had a nice spreadsheet in the end Which displayed where priests were moving. So investigative journalists and computers are like a perfect match Of course computers are used in many other areas of journalism Every every publishing is now every major newspaper also as a website, of course And there's now robot journalism coming up where as sport teams So sports events are covered by by computer programs not by humans anymore But what I'm specifically talking about here is investigative journalism and a term that started as Precision journalism or database journalism Computer-assisted reporting data-driven journalism is the current term and there's also like computational journalism all these terms basically mean that you use a computer to do an investigative story and Philip Meyer one of the first Investigative journalists who use the computer said a journalist has to be a database manager and so we can't quite compare a database admin with a journalist, but It's getting closer. So a journalist has to have its facts They have the facts and of course there are too many facts to keep just in their mind So you have to put it in a computer And now I will basically present a couple of of fields in computer science that investigative journalists use To make their stories happen One of the big ones is of course natural language processing You know the Snowden leaks or if you might remember the the offshore leaks and a couple of leaks that follows That followed when you have a big leak of data or you got a big leak of documents Or you got documents via freedom of information request and these are like Thousands and maybe hundred thousands maybe even more documents that you get either in paper form Or as PDFs and what you do with them you can't possibly read them all And They the current newsrooms. I don't have enough staff. They don't have enough time to Spend on this on these investigations and so they have to use Computers to make this job a bit easier natural language processing is Perfect for that. You just put all the documents you have Inside the computer. You have to possibly OCR them and then a couple of things might work in your favor. So there's entity extraction, of course that finds out okay the these documents contain these entities and So it's not only mr. Obama, but also president Obama Barack Obama And then you can basically extract these entities and know which documents talk about which entities and you can extract Who if for example an email dump who's talking with who? or also Company names. This is like easily is extracted with with entity extraction techniques as deduplication There's topic modeling. So, you know, okay, these documents talk about these topics And so I don't have to read them if I want to focus my story on this specific topic I go down this path and only look at the documents that are basically automatically categorized in a certain category and Part of speech tagging is often also quite useful when you Look at the document for example debates. You want to find out When who's talking about what in what kind of way you can find out that out with part of speech tagging And of course basic search a search is always quite useful There's many advanced ways to search and your journeys have to use them to make sense of these big document stacks of course and we have So a document Search has been a big part of computer science since like the 70s 80s Now nowadays we have like solar or elastic search or other search engines that do that quite easily But these are made for computer programmers, right? We be use them as developers and set them up and configure them and build our own back-end Front-end on top so that other people can actually use the search behind it and journalists have a couple of more they want to use a couple more features in there and we have a couple of Applications that help us there and namely document cloud which is a service where you can upload lots of documents and they automatically like Submate searchable entities are extracted and then you can also publish them for your readers to also look at them overview docs which does Topic modeling so you can dive into your documents more easily and there's The project black light which is basically a solar front and that you can So I can use to give to my journalists colleagues and they can use a basic a solar search In an easy way and that's also called of course Google refine Which is usually used for tabular data But it also has a very good reconciliation back end and also clustering where you can do deduplication so if you have like a list of Company names and they are like very dirty you can Reconciliate them or basically deduplicate them and make all the the company names match again There's also pro software and namely new X or IBM Watson analytics they are very expensive and Most journalists have never seen them and possibly can't use them. It's Very difficult to get your hands on them And that is quite sad because we have to so journalists have to rely on these open source tools which and they're only a few that are actually made for investigations and To the computer science part, I'm mostly talked now about Basically English language models. It's very difficult to find good German models that are already integrated in some kind of software so you can use them in the in the German speaking world and And that's kind of I hope that changes soon Then there's a machine learning which is mostly mostly used for classification tasks There's another big field of course in the computer science area In statistical analysis to find out what belongs to which category And of course, there's a bit of neural net deep learning coming now away You can see there's been some deep dreaming in this in this picture But I haven't seen like any journalistic piece that has used that yet I worked a bit on was something where I use like neural nets to crack some captures for some databases to scrape them better But this is still in the making Here's one story that actually used natural language processing and machine learning in order to identify police reports of the Los Angeles Police Department which misclassified over 25,000 Crimes so when a policeman comes to a police officer comes to a crime scene and He writes down a report it gets later put into a database and then classified by another clerk and They classifies they classify the the crime that happened as a minor or serious or another category and Based on the description the Los Angeles Times wrote a machine learning classifier and that looked at basically the description of the crime and a a proper classification and trained it on a training data set and Then used the whole data set to look if all the other crimes were properly classified and they were Apparently misclassified over 25,000 of them and of course you can't go through all these records and you know Classify them by hand a machine can do that much more easily And it has also been confirmed by the police department that there has been a misclassification going on the result is that many the crime statistic is much lower in less serious and more minor crimes and serious crimes and You can basically cover that up through misclassification and the Los Angeles Times Times could uncover that through machine learning Then there's this big field of social network analysis a favorite topic of mr. Frisch Lindberg here in the second row and Social network analysis is basically The the bread and butter of every journalist's work we are collecting information about certain entities and we are trying to find their connections and this is then basically put into you can put that into a Chart like that. So this is like a network graph and the problem with that is the result is mostly not journalism It's it's just a research database you collected some facts and then they are then you can display them like that and but it's also very subjective data collection and because you only basically cover the connections you think are important and you possibly don't see any others and And it's it's more like a knowledge management tool where you can you know collect everything, you know To better collaborate with your fellow journalists But as a as a result, it's it might not be journalism. And so we can't say okay I got this big a graph and now I do an Eigen eigenvalue centrality measure and then I found the bad guy It doesn't work like that. So you can't like compute The the bad guy out of such such a graph what you need to do is do like proper journalism on top So you have you have like a knowledge graph and then you can look at it and then you can interview people you can find out more through like proper like Old school, let's say investigative work And so what you hear in the scene in the background is the lobby radar, which is has been now a shutdown it used to be run by the city f and But this is more like a like a like a piece of art then, you know gives you an actual insight here It's it's difficult to make social network networks appear as Make them understandable, let's say and then there's a brand-new field of algorithmic accountability We also heard like for example talk about the VW diesel gate scandal and is topic of algorithmic accountability more and more algorithms are put into every device we know and making decisions that affect all of our lives and Now we have some hackers that do some reverse engineering and that is great And then they present at Congress, but of course this is basically journalistic work and we need to bring these techniques into the newsrooms The newsrooms need to understand the investigative journalists need to understand how this stuff works and how to reverse engineer it Nick Diacopoulos a researcher in the in Washington, I think and Did a lot of work on that and for example? one thing was a Stock trading plans of executives that are pre-planned and you can analyze how the plan works And if it let insider trading is behind it or for example, how does the iPhone auto correct work? You can observe the output you can observe the input what what's happened what is happening inside and another example would be how our prices displayed in on retail sites for different geographical areas for example and analyzing that is Not an easy task, but it's becoming more and more important especially if there's not much transparency around How these things work? So journalism becomes It's closer to science let's say The investor investigations in journalism they use the investigative method method and you also have a hypothesis like in science and you You make up something and like these these kids are underprivileged because of corruption going on in the school system and then you have to prove that hypothesis and so it's very similar to science and Science also moves now into a more reproducible and transparent manner the story I told you earlier about the LAPD and this is the code that was used for this story you have a like a classifier machine learning classifier of a support vector machine and You can basically run the code yourself and to train the classifier and then classify some of these reports and they only published like a tiny training data sets and Only parts of the data, but they basically make their methodology transparent and This is basically also where science is going many many research papers nowadays are not reproducible, but they should be and this is a Jupiter notebook and you can basically create a pros and code mix and then execute it and Look at the result and anyone else can also reproduce your work So this is Python, but also are is our like favorite languages of investigative journalists in the data area then one big thing I discovered was that soft engineering in the newsroom is Not that easy First of all, of course, there's IT support and there's a problem of a CMS So the content management system is always a problem As a as a software engineer, you basically always fight the CMS In in big organizations like the New York Times They basically create their own hacks just so that they can put their beautiful graphics in the rest of the CMS they it's like big hacks going on there, but this is This is not what I want to talk about soft engineering in the newsroom is basically also building tools for your fellow journalists and and that hasn't fully It doesn't have its roots So soft engineering doesn't have its roots in the newsroom and that's why it's a bit difficult and at the moment Right now journalists writes an article and then it's published and then you can forget about it You never touched it again. So there's no technical debt in articles and code is So sometimes code in newsrooms is also written for a single story So you write code for that story you publish it and then you forget about it But of course as software engineers we learn that is not how to do things We want to we don't want to write the same code again for the next story We want to have something reusable. We want to fix the buck only once here We don't want to fix it a million times of all of our articles and that means we need to clean up a bit and Develop some kind of some kind of method to write a software in the newsroom. It's currently Quite a quite a hack as I perceive it And then computer science papers and I love to read them They have very interesting ideas, but mostly they don't come with code and when they come with code it's not running code it's difficult to actually make that run and When I actually compiled that C library to make a machine learning a bit faster And it's still not usable software. So I can't give it to my colleague to actually use what I compiled there and This is definitely something that also So I hope that when you publish something in computer science that I don't know you give me something that I can use To actually bring it to my newsroom to make their lives easier And also collaboration Which is something that is basically innate to the open source software scene and is a bit more difficult in newsrooms There's always the competition going on and the investor especially investigative journalists are Used to be perceived as lone wolves if you are onto a story and someone else has heard of it You better publish soon because and the other guy might you know Scoop you on it and then your story is burned and you can't publish it anymore and all the work you did for that was in vain instead in on the other hand in open source software, it's great if many people collaborate on a piece of software and the higher the bus count is the better and so we need to basically bring this idea of collaboration into the newsroom and This is also still Yeah, a problem that is this it is not quite there yet there are some collaborations now between the New York Times and the Washington Post for example or the ProPublica news organization and another Bigger publication in the US I think and as corrective We also collaborate with many other news organizations to that they publish our stories and with us together And we hope that this idea of collaboration and that is basically a Software idea as I perceive it is also coming into the publishing of news stories and Another big problem is that we have some software and we might as well gonna use it And if there's no other software, I can only use what I have so the hammer nail problem It's definitely something that is in the newsroom Have you ever seen like a map in in a in some kind of news article with lots of points on them? And that's because the journalist that did the story had this basically this map to mapping tool But I could put in like a bunch of data and then it put it on a map Even though it might not make any sense regarding to the story They just use the tool that they have or for example timeline, right? Timelines are also like there's an easy tool to make a timeline and then you have a timeline Even though it might not be the best way to present your story It's just the tool that is there and develop developing another tool might not fit the deadline or like your resources so I basically say you need more applications for our society and Many advances in computer science are quite slow to benefit the public at large If there is like a big jump in let's say machine learning a Google knows it first because they do the research they develop the applications and Other other big companies like I don't know Palantir and the NSA or and ad companies they basically use the latest research and to do better user tracking or better targeting and so they benefit quicker from these developments and So because they do their own research or because they have more resources and many cutting edge research It's basically comes out of these corporations like Google For example recently tensor flow from from Google brain got released. It's a better like a machine learning library There are other machine learning learning libraries, but this is like one that is very usable And it is an avatars that it's better supported better documented easier to use But it might not exactly fit the journalistic use case And so journalism needs more resources to develop their own tools the tools I mentioned like document cloud or overview docs are quite good They're basically targeted a journalist that develop it developed by journalists and They they fit the use case quite well, but it took like six figure amounts as I recall to develop them over the years It was very difficult to basically get the use case right For example Google refine like an invaluable tool to many journalists to work with tabular data and clean it It's really used a lot But it was developed by Google and then open sourced and that basically means it hasn't seen a release in two years And that is kind of kind of bad that we don't have the resources to basically work on the tools that we use in the Journalistic trade every day So my call to you is support journalism as a service to the public And help journalists develop develop the tools and what we have here is basically a public good journalism We try to make Try to be in the service of the public and so for example join a newsroom if you can it's really a fun work So when I joined the I joined a newsroom simply because I think it's basically the The best political activism I can do and with the most impact and not only focused on technical topics We hear lots about the data retention and I don't know other data topics But when you work in a newsroom you get to like a very broad range of topics from all over society And you can still help with data and literacy And another another hint. So if you want to get in touch with journalists, there's a thing called hex hackers Which is a meetup and that is in every big city in Germany. Okay, I think it's only in Berlin and Hamburg But I think that there's like a data journalism also in North Africa's failure and but if you're from any place else like in New York, London They all have hex hackers meetups hex the journalists and hackers. Yeah, we are the hackers and they come together there to meet and talk about technology and journalism and So if you want to have like an idea of what's going on in that world and join a hex hackers meetup and I don't know improve journalism by contributing your ideas. Thank you So I think we have time for questions We have a question from the internet, please Yeah, hi, the internet is asking many things Actually, the most important question is is your data mining software available as free software and please mention some of the names of the tools you have used my data mining software and I didn't write a database who I always did write data mining software basically and that is a problem for every story you write like a script that does it and That has advantages because you can customize it. It has disadvantages because you have to write the software and it's not quick and easy For example, we as an hour news and we publish all our work on on github at github.com Corrective and you can have a look basically at software that is there Mostly it's just a front end stuff but we also publish more basically back-end data so data analysis pieces in the future and Many news organizations have repositories on github that explain how they do their stuff and You can find their software there and the other question was Tools or something? So I mentioned tensorflow is a machine learning library, but they are like many many tools for journalists They're mostly that they're not tools that are much more library So I'm using pandas for Python, but they are also very very a lot of our packages that you can use for data analysis but the problem is that nerds are a minority in the newsroom and that means that If you want your journalists to use these tools use these techniques You have to write tools to make them usable for like the normal people Thank you normal people. Okay. Thank you Not that nerds are not normal people, but yeah, we have another question from you, please Thank you very much. You've been talking a lot about net language processing tools machine learning tools and all of those of course No are known to fail to produce errors to miss quality miss cool miss Classify classifier. That's the word. Yeah, and even if they classify correctly It's not always easy to see what the classification actually means you alluded to that shortly When talking about graphs and saying you don't only look for the central person a graph and that's the bad guy So how to do how do you deal with that risks with misclassification with with like And also the illusion that the data could provide you some knowledge inside that actually is not in the data Only apparently is there Yeah cross cross checking Like normal cross checking you do with data and like check your data before you put it in there for a certain things a quartz recently published a long list of how you interview your data to make sure it's It's up to a certain standard or that you're at least aware of its failures many times already the input data is flawed in many ways and then Your methodologies of course double-check it and talk to experts that know more about this field than you do and By publishing the methodologies you basically make yourself vulnerable, but also transparent So if there's something bad going on your readers or any other interested party can Basically run what you did and then tell you about what you did wrong and so any of these machine learning things does not live the journalistic So as a journalist you still have to Validate your your findings through like second means or at least to do a Check on a bigger sample So it's definitely and the result is not coming out of the computer the result is coming out of the humans So there the result of our research is it's not basically the output to stand it out Thank you so much. I think we are all done now for the minutes we had for questions. Thank you so much