 Hello everybody, my name is Benedikt and I'm a data journalist at Züge Zeitung in Munich, Germany And I'd like to give you a short presentation about how we are doing journalism with our Just a short content introduction thing Data journalism is one of the latest developments in the media industry. It's about 15 years old you could say and well, it's has some roots in the 1950s, but most of it is a totally new form of journalism The idea behind it is to use statistical tools and methods to investigate and explain developments in our Data-driven and more and more data-driven world And I'd like to give you a short introduction and a short overview about how we're usually working with our odds our casual workflows for a data-driven journalistic project Why does that matter? Well, there is a famous report by the New York Times a couple of years ago Where one of the first things they stated there was Not enough of our report uses digital storytelling tools that allow for richer and more engaging journalism Too much of our daily report remains dominated by long strings of text. That is something that's also a big concern for our newspaper the New York Times Calculated some numbers and They found out that just around 12 percent of their stories Had some deliberately placed visual elements, so that's not much. I mean usually there are a couple of hundred stories per day But that's also an idea we're trying to get better in place more Visuals to explain things that might be explained better in a visual better for readers Who are we? Well, I see Dr. Zeitung is one of the oldest German newspapers It was founded right after the war. It was one of the first newspapers in Germany today it reaches around 1.3 million readers in print and Well, a couple of more in our digital channels in May, which was obviously pretty strong Thanks to COVID-19. We had nearly a hundred million visits on all our digital products And I am part of the data and digital investigation team We are five journalists and we combine the work with data With expertise in digital research. So we use in digital tools technologies to investigate stuff Most of our team members have a background in social sciences We have one who studied math and physics and one who comes more from a developer perspective What do we do? Well, we have a couple of things that we're doing. We're analyzing and visualizing large data sets Obviously in a way We're helping other departments with data-driven reporting So they can ask us questions when they having some stats or something And maybe they have questions about the visualization stuff And we're also doing a lot of shorter and long-term research and Investigation with data and digital tools Shorter projects which usually could last between a couple of days to one or two weeks, maybe Could be analyzing election results. That's typical thing data journalists do Wrangle through the numbers and try to find stories out of these numbers We also try and analyze the impact of the ideas that are currently being discussed in the political debate and If we're lucky, we can find a number that might add something to the debate Long-term projects can range from a couple of weeks to a couple of months And if you're looking at the Panama papers or the paradise papers even a couple of years So it's a very broad range that we're having there One project that's not about a large-scale investigation was something a colleague of me did Some months ago. She was analyzing the speeches in German parliament and built a NLP model So she didn't have to compare word by word But she was trying to compare meanings over time in the German parliament How do we do this all of this? Well, we're trying to break down the workflow in a bit But let me start with this that Most of you might know Hadley-Wickham's visualization of the data science process And I think it's quite similar if we're taking the data science process in journalism With some slight changes. So a lot of the stuff is pretty pretty similar But we're taking out the model part of it Data journalists focus less on the model building there are exceptions and It's I have the feeling it's getting better and a lot of a lot of more models are built for journalistic purposes But we have experienced a lot of problems with that when we're trying to communicate uncertainty and Models usually come with some kind of uncertainty and it's very very hard to explain uncertainty to readers and To explain what a model does to readers. So as I've mentioned, I have the feeling it's getting better and more models are being built to explain stuff and Make journalism better But right now a model building is not one of the big domains large domains in data journalism But the rest of it is pretty similar I guess to the data science workflow. It's about finding suitable data It's about data wrangling of course a lot of a lot of the time gets in that data analysis and data visualization is very important for data journalism as visualization and Communicating the data's like the biggest part of our job How to find suitable data? Well, the big difference I guess between data science and data journalism is that we are using a lot of publicly available data open data from government institutions, for example We're also using a lot of data that's crepable from the web and Sometimes when you're working in the investigative department, you're using data from leaks, for example the Panama papers Let's dig a little deeper in that public data I had the experience is very easily available in the US and the UK that have a strong tradition of Transparent government in a way in Germany. It's not that easy There is a freedom of information law on the national level. So for national institutions, you can ask for certain data sets But in Germany we are very strong with data privacy, which is good But it has a downside if you're the one that wants to get data It's always a nice refusal a nice excuse for refusal of this open data and Not even every state has a ever-while law for example, but area doesn't Some examples of public data sources Well, the biggest public data source I'd name in Germany is these status, which is the federal agency for statistics Which also collects from the single states in Germany and they have a huge Repository of data and they are very nice and always try to support your data requests. So they are very useful data source for us and I'll name another one the federal agency for disease control and prevention the Robert Koch Institute Which is currently in the news they Publishing the COVID-19 data and they were learning a lot in that during the last months They started out not that good with a with a table on a website and right now they're having a open data Well, you could say an open data website where you could use an API to access the data adjacent format You can also download the CSV so they got a lot better during the last time Yeah, I'm always trying to name some packages that we're using for these status the federal agency of statistics You could use the D status cleaner that a colleague of mine wrote a couple of years ago But luckily they recently really recently added flat file CSVs for download So probably you won't need that in a package anymore. You could just download the flat file CSV from them They used to have multi-line headers CSVs and all that stuff So stuff you really didn't want to care about when just reading in your data Scrape data is something we're doing a lot could start by just extracting tables from web pages accessing including jason's to flat files is basically not scraping but Well, it's making data accessible. That's on the web And automated data collection, of course is another important part of that We're mainly using read out if it's already in a flat file format our vest is library We often use for web scraping and jason light is my personal favorite for accessing jason's Leak data is something a bit separate cause it's usually not transmitted via the internet Usually you get a USB stick and hard ticks drive or dbd And if it's a really sensible data, you do usually not sitting on a machine that's connected It's usually on a machine that is not online and it will never see the internet We have database system in use it's open source. It's called olive Where we are trying to make all the data accessible Without looking at the the document type so you can have emails you can have pds whatever It's OCR art and you can get a full text search in there Data wrangling in our Something that takes up most of the time of our work. I'll give you a current example COVID-19 We are currently producing around 10 automated charts that are being updated every 30 minutes from various sources we're putting all that together in a single data frame and Having separate layers or separate factors Where we can just filter out the data for each individual plot But it's much easier to have one data frame that we're putting into our archive And that's being appended all the time We're uploading the data to data wrapper data wrappers our visualization tool With the big advantage that it's very easy to get mobile ready charts from there Charts that are being put on for desktop Screens and even charts for print you can get out of that This is the COVID-19 project folder put out a lot of stuff, but basically what we're having is a big Functions folder with input functions production function processing functions that are all settling on based on each other Input gets everything in then we got that data frame then we're processing the data frame calculating for example doubling rates And then in production each chart got its own But his own hard script that pulls the data out of the data frame and pushes the data up to data wrapper And we're just running that on a server The typical data-raining libraries Well, I guess a lot of you use the tidy verse and we do as well deep liar stringer and tidy are other ones We use most And a big part of our job is to find a story in the data So data analysis plays an important role in our job And there are some typical story points in data journalism that I just wanted to mention here winners or losers or something We're always looking for high or low top or bottom are just equivalent Outliers are very interesting and it's something that is a real outlier Which is usually a story for us and change over time is something you see a lot in data journalism Which is pretty obvious if you have a couple of points of time Then you just plot and if you see something it's already a story We're using deploy our lot and if you're having something special in mind. We use specialized packages So I just brought up quantita There was a project I did a couple of years ago where we're trying to compare The position of a party to its election program and there was the word fresh algorithm used there And it's implemented in the quantita package. So it always depends on what you're going to do Let me put something more effort on the on the story part This is a project a colleague of mine did about the children's books and she collected thousands of Children's books with keywords that describe it was happening in this book. It was from a library in Germany And she didn't the analysis and the visualization are so that's a good example And this is like well, you could say it's a work cloud basically. It's a graph That shows some of the keywords that are associated with a male or boys adventures in children's books And you can see for example You can see pirate you can see fight you can see often you can see Africa and Asia historical story. So that maps the type of Male adventures or boys adventures that you can find in the children's books in our data It gets interesting when you compare that to the female adventures as you can see here They're not that many dots in that graph which shows that female adventures are limited to Less keywords. That's the first learning and of course the content is different as well. It's more about love It's about vacation. It's about horseback riding or princess So it's not just the content that changes but also the diversity of The stories is different That's one interesting learning that I've got from this from this analysis from a colleague If you're doing data Visualization and are we usually use data wrapper for let's say simpler digital graphics Because it's very fastly done and it's already mobile ready And we got a slight interactivity that you can use or you don't have to use it So it takes a lot of the pain that you usually have when you're creating graphics for digital products For fancier graphics, we usually most ggplot which is also very nice We put that in the newspaper because the print designers can just access the SVGs or PDFs and work with that The problems we encounter well Maybe some of you have a pretty similar feeling when you're doing data science Some of our colleagues have no idea How tedious the work of data can be and how long it might take the problem is for them Some of them it might look like magic what we're doing. We're just typing something into our keyboards and our laptops and we get a result graphic visualization so yeah, it's always hard and it's always a lot of explanation needed what we do and how we did and Well, I got the feeling it's getting better in the last years But maybe you experience something similar There is one thing I'd like to mention and which is sometimes being discussed in journalism that Tended to be a lot more discussed a couple of years ago is data journalism really Journalism was just tacky hacking into your laptop and I'll argue. Yes, it is But it's something different. It is another perspective on information gathering because you got another type of source data is now your source that you can ask questions and It's another perspective on communicating research results to our readers or user users because it's not a Lot about texts. Well, there might be an option that you're writing a text and not having any visualization at all But mostly the work we do And in the end got some form of visualization So it's also another perspective on showing the results of your research to the users You can find a lot of our Stories on the project side project a dot suit or two dot de and some of our code on github We're trying to publish more of our research code mostly are sometimes python but Yeah, you know, it's takes a lot of time cleaning up this stuff. And so that's always an issue But we are really trying to do better If you got any questions left feel free to contact me On my email or you can send me a PMDM on Twitter And if you want to find out more about the topic data journalism Just follow the hashtag DD out for data-driven journalism on Twitter, which is very helpful in you really Get a sense what's going on this in this community. Thank you very much and enjoy the rest of user