 Mae gafodd yw'r ysgol i chi'n gydig i'r Ysgrifennu Ihaithfa Lleicw Ysgrifennu. A'r ystod, mae gennym eu cyfnod eich roedden amser yn ysgrifennu yn ysgrifennu. Ysgrifennu Ihaithfa Lleicw yn ysgrifennu yn ysgrifennu. Oherwydd y llun y llun ysgrifennu i'r ysgrifennu. The point on this series is to show how statistical computing, in float particular, has spread or escaped out of statistics and is now having effect in the rest of the world. Hakamol Singh is probably New Zealand's first full time data journalist, he's data editor at the New Zealand Herald. Dyma'r cyfwyr, byddwn i'r dymeniwn hwst, a dyma'r cyfwyr ar y tîm i Waicato, yn y taith yng nghymru, rydych chi'n gwneud y cyfwyr eich cyfwyr ar y Cymru? Mae'r cyfwyr yng Nghymru yn y ddechrau yn ddifrif am gyfwyr yng nghymru i'r ysgol yng nghymru ar y ddweud ysgol a'r graffych Cymru'n gweithio'r cyflwytoeth yn gallu'n ystod, byddwn yn y gyflwytoeth rheol, a'r rhaid o'n ddysgu'n gweithio'r cyflwytoeth. Nod, maen nhw'n... So ydych chi'n... Ac yn y ddechrau'r cyflwytoeth i'r hyn o'r data journalistiaeth yn Ysdreifeth ac yn y Ddechrau'r data journalistiaeth yn Ysdreifeth ac rwy'n cael ei ddweud i'r gyflwytoeth. Rwy'n cael ei wneud hynny? Mae yn y ddweud o'r cyflwytoeth. As Thomas said, I'm going to talk about our and data journalism in New Zealand and I wanted to give a bit of context. So I was briefly going to mention sort of like the brief history of data journalism worldwide where there's 538 upshot and Guardian had a data blog which led the way initially and there's ProPublica and we've got Herald insights. I don't think we're in the same category. So hence the emoji there. But this is, we're probably sort of one section of New Zealand media which actually does data journalism fairly regularly and because the media market is quite small we're leading the way and that's the advantage of being in a small market that you have to only do this much to be the frontrunner. But how many people are aware of 538? Just to show our fans. So few people are not. 538 is Nate Silver's website. Nate Silver famously predicted the 2008 and 2012 elections working for the New York Times and then he went and started his own website which is owned by ESPN. And I'd really recommend you check it out. It's probably world's only dedicated data journalism website. And they apparently got it wrong in the election last year, which everyone likes to point out, but Nate actually had like 30% probability for Trump winning, which was better than one person probability which Princeton had. Anyway, so I'm going to do like a brief introduction of what data journalism is and what we do sort of day to day in our workflow and where our fits in and what we try to produce. So what exactly is data journalism and how does it compare with traditional sort of storytelling in the newsroom? So one difference is that all the stories and all the work we produce, data is integral to the story being told. A lot of times it's happened in the newsroom where someone will come with an idea, most probably they have seen New York Times or Guardian do something or something someone has done a story in Australia and they're like, can you do this without actually knowing whether the data exists or not? And that is actually the fundamental sort of idea that a data set has to exist to be actually able to talk about the topic. And 538 defined it as empirical social science and deadline. I put it in court marks because I don't really think we do that, but 538 definitely tries to do that. And they're a bit more ambitious than what we do. And so sort of the idea is that rather than telling stories where you find sources where people are saying something or you're publishing from reports, you're actually doing your own analysis, you're taking data sets and you're analysing the data and either you're doing modelling which is what 538 predominantly does or you're doing visualisations to tell stories. My favourite code about data journalism is actually this. You have to be like the worst tabloid newspaper in the front and the Academy of Science in the back. And I think this is actually how it works because most of the time data visualisation and Rosling wasn't talking about sort of a data journalist. He was actually talking about his own work where he popularised data visualisation. And I think that's quite interesting because you have to be passionate and you have to really sell the story and that is actually fundamental to data journalism as well. So why data journalism and why not just call it journalism in itself. It's a buzzword and buzzwords are actually really helpful sometimes because people can hire you without actually knowing why they're hiring you. So if you ever get a buzzword, use it while you can. So part of the popularity of the buzzword was Nate Silver predicting the elections and also New York Times doing this really interesting stuff with their visualisations back in 2012, 2013. When they had Mike Bostock who created D3 which is a JavaScript library to create these excellent visualisations working for them and that's kind of why it became a buzzword. But essentially, as I mentioned before, the method is fundamentally different. You are interviewing a data set rather than interviewing people at the forefront. You can't go and talk to people about the data set and that's actually really helpful. And the idea is, so if someone sends me a data set like say a real estate company sends me a data set, I generally ignore it. And the reason is they've got a motive in supplying that data set and most of the time I'm going to public data sources to find data sets which are actually better than what is being supplied with press releases and so on. The end product in our case is visualisations. We're trying to produce interactive visualisations to tell these stories. And the reason for that is because it's actually easier to have the New Zealand media understand the importance and power of a visualisation than a statistical model. And that's the challenge that Nate Silver has is he does predictions but people just want to know whether he's right or wrong where visualisations actually reach wider audiences. And the one thing that I considered fundamental to doing data journalism was writing code as a journalist and part of the reason was that there was actually no one to help me. Like no one could produce visualisations for me, no one could analyse data for me and I think I've taken that motive away at the Herald because I'm there so there is no incentive to learn to code. But I kind of want to emphasise this that writing code is a journalist, like I trained as a journalist, I was into programmer and it actually fundamentally changes how you approach stories and what you think is possible when you actually get a data set. So your tools in many ways define how you think and how you approach a story. And this is true for someone who does multimedia and video journalism and someone who writes long form stories. You try and get good at that and you try and utilise that space. So writing code you actually look at possibilities in a very different way. Why do news organisations do this? As Thomas said, people seem to like it, which is clicks or awards. But it's actually, I think the simple thing is that it actually works. And every sort of two, three months you have to come up with something that actually really works. And it can be a data set which I wouldn't visualise otherwise. So when the Rugby World Cup was happening I visualised all the all black tests going back to 1905 and basically showing in one single visualisation the dominance of all that. It was actually a really fun visualisation to do. But again, it was basically I was like, you need every so often something that actually makes the newsroom or the media organisation and the media community realise that this actually works. So part of my talk is that it works if it's done well and correctly. And that's not as straightforward as just deciding to do a visualisation and just visualisation appears. But it's actually really hard. And this is what I wanted to talk about in terms of the process and where our fits in in terms of my workflow and what I do at The Hero. But I want to talk in the context of what we're trying to produce. How many people are aware of Herald insights? A few people are not. So I'm going to show some visualisation. So this is what we try and produce. This was the most popular visualisation last year. It's simply a meshwalk map of burglaries. There's no analysis of what's ever. It just tells you how many burglaries happened in your neighbourhood slash street. But it was the first time that the New Zealand police had actually released this data set. And what we simply did was publish a map where you could jump in and see the burglaries in your neighbourhood. And what was really interesting about this sort of map was that it got over 200,000 users in the space of two days, which is quite high for a news article. And from my perspective, more than the analysis part of it, it was simply that people could actually access this data. So there are about over 35,000 meshblocks. So you could publish this data in raw format. And for people who are actually expert data users, you would actually want to play and look around with what you can do. But for a majority of people, there is actually no way to access this data. So what this visualisation simply did was make that accessible. And it actually apparently caused a lot of problems within the police. Because everyone was trying to prove that their stats were not as bad as they looked on the map. So apparently there were emails sent around like, you know, we are not the worst, the herald is completely wrong. So putting the data out there actually has real impact. And I think this is something that I want to emphasise that for people who have data analysis skills or data visualisation skills, you can actually have real meaningful impact if you can communicate the data to a wider audience. One of my favourite projects last year was sort of this explainer of what's behind the migration numbers. So every month stats and it puts out this press release where they say record migration. And what really fascinated me was that no one had actually gone and looked through the data and what it showed. So it's a really simple visualisation but it's probably my favourite one from last year. So that chart basically shows arrivals and departures. The next chart takes out New Zealand and Australian citizens. So you actually have the context of what migration actually looks like. And this start is going back to 1979. And then the next chart showed sort of arrivals proportions by region. So this is Europe and this is Asia and it had grown from like 8% to 37%. And then I kind of wanted to emphasise so this is what people considered was happening, that a lot of people are arriving from India and China. But the context that was missing was taking another data set which wasn't released as part of this series, which was the visa data and you would actually see that blue line is actually number of student visas. And you'd actually see that there weren't any work resident or visitor visas. So the pattern was actually really different once you actually looked at the visa type start. The best reactions I've got on the story was on this story which was like, huh, and then people were like, didn't realise this. The most interesting comment I got was this person at Statsense who was interfering for the story who used to work a lot with this data and when I showed her the visualisation and she was like, oh, I actually never looked at the data this way. So just the power of visualising a data set in a simple way. And I kind of wanted to emphasise sort of different patterns for different countries. So China looked really different, Philippines looked really different. And at the end basically published an exploratory visa where you could go and select the country and look at the data. And what was really interesting when I asked people, so most of the people who were actually migrants just went to their country. They like skipped the rest of the story, don't really care what's happening. This is a really cool project that my former workmate Caleb did with Matt Nippard which was banks investing in KiwiSaver funds, which was investing in cluster bombs and land mines and tobacco. And what was really interesting about the story was that the story actually sort of started for us because really ends up published the story at the same time. But they published it from either someone informed them, but for us it actually started with Caleb analysing a lot of data set and finding that this is where the investments were going. And what you could basically do was see how much was invested where for different kinds of funds and then you could explore it individually by banks. So what was really interesting from a news organisation perspective was that really ends had broke the story, but we had this which they didn't have which was we could give people individual sort of data for every bank and every sort of fund. So that's like a brief example of what we are trying to deliver and sometimes you fail, sometimes you succeed. One of the very first interactives I did at the Herald was a polling place map where I overlaid the previous election boundaries with new ones and then I put all the polling places there trying to show what effect the changes would have. It was a really complicated map in terms of like you had to select layers and everything. And someone said on Twitter, this is the worst visualisation I've ever seen. And you know you've slaved away for like a week and you've cleaned up all this data from election commission and then after like a pause I said to the person I was like did you zoom in? And they basically went oh sorry. And that's kind of like the challenge of learning how to communicate sort of visualisations. So since you know this is the birthplace of R, I pretty much learnt R is like central to my workflow as a data journalist and I wanted to show how we use R and it might not be as exciting for people who actually work with R every day but it might be exciting for people who don't work with R and maybe I can persuade you to try it out. So I basically started learning R because Amanda Cox who's the editor of Upshot at New York Times used it. So a lot of data journalists who've been doing interactive stuff have been using R for over a decade now. And when I persuaded my previous bosses to give me a job as a data journalist I presumed that this is what you need to know. So I just started learning it on my own and because I came as a migrant I thought everyone in New Zealand newsrooms would actually know what R is. Because it was created here. So I went to my first newsroom and I was talking about how I'm learning R and stuff and everyone was like, what? And then when I interviewed for the Herald I actually listed R quite prominently on my resume saying that I know R and JavaScript and in the interview they said, oh you know HTML and CSS, wow. So one of my context for it was that newsrooms in New Zealand were largely unaware that this language which was created in New Zealand was being used by journalists worldwide to do some great data journalism. And it's actually expanded so there's a strangely named conference. If people are interested in this I'd recommend checking it out called National Investigative and Computer Assisted Reporting Conference. It's based in states, it's called NICA which is much better to search, just search NICA. And they grew from like a conference of 100 people to about 1200 people this year over a decade and these are people who write code. So a lot of stuff you see in New York Times and stuff is people coming together teaching each other how to write code. And one of the interesting things I looked at R had about two sessions like five years ago. This year it had six sessions in that conference. So it's really used almost universally in any sort of data journalism newsroom. So I primarily use it R for cleaning up data and for doing exploratory analysis. Basically I use whatever Hadley teaches and every time a new package comes out I'll try it as quickly as possible. And if anything changes then I'm kind of lost. And I've kind of picked, so kind of started learning as Hadley was creating deep plier and all these packages. And where it really works for me is actually getting a data set analysing it really quickly, knowing what I need to do with it and doing some exploratory analysis and creating plots in Gigi plots so that I actually know what story I'm going to tell and to share analysis and code with other people. So when I say share analysis I'm just creating like thousands of plots putting in a Dropbox folder and giving to another journalist so they can actually look at the data themselves. So I'm actually going to show you an example. So there was a story that I did last year and it was probably the most complicated and most disappointing story I did because it was about NCA standards. I imagine most people know NCA and so NCA is the sort of New Zealand education system how you go through high schools and you have to do different standards and different subjects. It's really complicated. I'm not really sure why but it's really, really hard to understand what's going on. So every year the Minister says that NCA standards are sort of rising. NCA pass rates are rising and they put out this press release and it's kind of completely bullshit what they're saying because so like what I really wanted to explore was I looked at this data, maybe one of the first visualisations that I did at the Herald two or three years ago was on this topic and I've been endlessly fascinated with sort of this data set. So how many people know about DaySiles? So that's how sort of New Zealand school system is funded. Sorry, I don't have this handy but I did this visualisation over three years ago and it was essentially comparing internals and externals. So internals is standards which are done internally in the school and externals is when you're actually sitting in an exam and what I really wanted to explore at that point was that there is actually this is actually not that good a visualisation but to be honest it was my first one which I've done properly. I'll do it very differently now. So the whole idea was that if you looked at something with maths with calculus at level three and at DaySile 10 there's a gap between internals and externals. That's internals and those are externals. There's a gap in achievement, the excess is actually achievement. If you went to the poorest schools, the gap actually opened up to 50%. This was actually one of the best projects I did because it actually got picked up by researchers and a full bright scholar included it in a report. The most interesting thing was I did this story with Nicholas Jones when he went to the ministry and said what's going on, they were like nothing. What do you mean? So one of the most interesting things is when you find something like this and you try and explore it it's actually not fun sometimes. I'll come back to this visualisation. I wanted to show you the process of building this visualisation. This is probably my most well documented code. I used this example because I was like I actually know what I'm talking about. I used R Notebook which I'm quite fond of and it was essentially like this is the data we got. They sent us the standards data which is the each individual standard and each subject is made up of 14 different standards. We got it by every ethnicity and every standard. They sent us the data but they didn't actually tell us what the standards were. We have to wait another month to get a spreadsheet which will tell us what the actual standards were. Really the first thing I did was actually to clean up this data set and to make sense of this data set and to get to a stage where I can tell a story of what is actually going on. For people who are really experts and work with data you have to realise I'm doing this in a newsroom and like four years ago this was not even possible to do in a newsroom. So this is all using things that have been talked about in the previous talk. So as I said I'm a huge fan of Deplier and just the whole tidyverse and using pipes for everything including ggplot if I can. One of the first things I realised was that they actually had standard versions so standard went through different versions so not only do they have 14 different standards, one standard will have like 7 different versions so imagine how complicated that gets and that's the education system. So I did something really simple where I sort of took the latest standard filtered the data set and just really went for what I could do in a month because you could spend probably a year analysing this data set and what I really needed was a story to tell which would be understandable to sort of the herald readers and not so much like an academic exercise in finding out something. So what I did was I did achievement versus unit standards by desile and levels achievement standards are sort of standards where you have to you get excellence merit or no merit and unit standards are where you just pass or fail and I broke this down by ethnicity, level and desile and what I aimed to do there was basically look at these patterns over sort of different desiles so you can clearly see in level 2 there is a huge jump in unit standards so achievement standards from level 1 as a proportion drop dramatically and here's an interesting fact the stats that the minister codes for pass rates are in C level 2 so when I first did the story at the news editor I was working for really wanted to say the schools cheat and like this isn't the other side of data journalism where you have to tell them that is not actually what the data says so a lot of schools are trying to get kids across the line in NCA level 2 and that's what it was showing over there and so this was all part of the exploratory analysis and then what I did was I did this plot where I split it by different ethnicities and you can actually see the difference as you go from Asian to European to Maori and Pacifica and how it became quite apparent where Maori kids were doing way more unit standards and sort of after going there so we actually had this data by every subject so all my analysis was actually contained in this one R notebook because it just kept getting so confusing and I wanted to be able to come back to this data at some stage as well and this is where the tidy worst thing really works for me is that I can actually read my code quickly and know what I'm doing and it is actually contextual to my analysis as well and then I started looking at individual standards for subjects sort of in the end this was a lot of data I actually wanted to look at what proportion of entries were happening into the sciences at different NCA levels and what it showed you quite dramatically in NCA level 3 it was Asian, European and Maori and Pacifica were just at the bottom and that's a story most people would know and most people actually know but this is the actual data which actually shows you so I created these plots and GG plot and then I started working to tell the story in the interactive so the interactive follows the same thing the only difference is so all my analysis and my storytelling has actually happened in R, I actually know what the story is what I'm going to tell and I've actually created the plots if only web was just about publishing those plots but then I actually create these plots in JavaScript and try to tell a contextual story and this is the other side of data journalism data visualisation is actually a whole another thing and telling stories which are this complicated contextually can be really hard and what I did was I focused on level 2 because that's the stat that the minister quotes most often and you could switch to level 1 and level 3 I really like transitioning charts even though they might not be useful and it did this achievement versus unit standards comparison and then I showed you a chart for Maori students so there's an instant comparison what the overall pattern looks like and what it looks like for Maori students and then you could select a different ethnicity and then it went into comparing those standards for sciences where you could see that everyone starts together and they go to level 2, there's a slight difference, they go to level 3 and it widens and then I had subject data as well where I still wanted to do external and internal comparison for Maori and Pacifica students and then I let people explore by different subjects where I think the project didn't work was that I was actually really ambitious I captured everything about the education system and then I was like while I was trying to finish the project I had to go back to go back home and it just like didn't really work, I didn't think it worked but what was really interesting for me about the project was this project would not have happened three years ago when I started out in the newsroom there was no appetite for doing something that complex people really wanted other people to say what the data said so when I did the internals and externals project at the first time the news editor just wanted the ministry to say something and I kept telling him that well the data says this but he's like no what does the ministry say and I was like it shouldn't really matter because the data says this and that has been a real challenge and the reproducibility of this is really important to me so that I am going to try this more often to share the code so you can actually see my analysis it's a bit scary because if you're not a good programmer you're like I don't really want other people to see my code the one thing that I wanted to mention was in data journalism context sort of using excel a lot of stuff in excel and a lot of times stats and threads would supply you with prebuilt charts as well like where are really comes in is that if from my perspective your tools shape up how you think and what you can do in excel you can't go back to it you can do analysis it's not reproducible you can't script it and when the next data set comes you can't actually just plug in the data and execute it again so if you're a journalist or if you're a programmer who wants to analyze data I'd really recommend sort of trying out R and doing the data analysis process there so I prototyped these visualizations there and that was the final visualization that I published and this is my process for most of the visualizations except for when there's an earthquake and the editor rushes to you and says can you produce something amazing for me then this is not the process then you just do a map as quickly as you can so but for like really interesting stories our exploratory tools are immensely useful and I wanted to share this one repository it's actually not a great data set but I had a lot of fun doing this so I published this piece earlier this year about the rise in distracted driving where those are mobile phone offenses anyone who drives and has a phone with them knows that the data set is actually under counting and what was really fun about this data set is I've been looking at this data set for ages and the New Zealand police publishes this horrible spreadsheet where everything is like in a different sort of sheet and the dates are on the top and then there is an empty column and then the next year so I have published a repository where you can you can just go and look this up if you want but I was actually quite pleased with it because I wrote this one function which would just you would run it over a sheet and it would grab all the column names and it would assign all these dates so the next time New Zealand police publishes all I have to do is add a month to it or add a quarter to it and I can really generate the charts by running those analysis quickly so one thing that I'm going to try and do is publish more of this these repositories online the other thing that I wanted to mention was ggplot some people there's this whole argument in our community online between using baseplots and ggplots and sometimes it gets really heated as well and it's actually quite funny to watch because Nathan Yao who's one of the best datavis people I know uses baseplots for generating everything but I have got a link here to this presentation from Financial Times which is about ggplot2 as a creative engine and they actually create charts in ggplot get them in Illustrator, clean them up and they go directly to print so I'd really recommend having a look at this my most proudest moment with ggplot was where I generated a chart and gave it to a journalist and it ended up on the front page of the hero I didn't actually put the background there but it was actually quite a lot of fun so the context to this we had this burglary data so I did do some data analysis with it and sort of the newsroom wisdom was they wanted to focus on rich areas and I just combined it with the deprivation index at meshblock level and it showed that poor people were all poor people and one of the news editors was really disappointed when I told them that I did end up on the front page and I was actually really proud of it so ggplot is actually really helpful in one instance with the data set at area unit level I generated 11,000 plots because I wanted to give it to a journalist rather than give an interactive I was like here's a drop box folder with 11,000 plots and you can explore them for yourself and this is what I really wanted to end the talk with because like I'm giving plots to a journalist and I've done this all this work and the challenge that future data journalists are going to face is that most newsrooms look at people who can write code or do graphics as kind of like tool people and not as journalist and you really have to assert the fact that you're a journalist in your own right and the journalist told me last year I'll do the words you do the data and I've been told by programmers that I'm not a programmer so there's like real prices of like what am I doing here but so whose analysis is like what you have done because you can share the analysis with a journalist but they can put their spin on it and you might disagree with the analysis and the really tricky part is what did audience get out of it sometimes I get angry emails from like random places and they're actually like really funny but sometimes you have to just go and be like this is not actually what the data says so visualization has this specific challenge of communication and it's something that you're constantly learning so if a visualization worked you can be nominated for an award which is kind of meaningless but the audience might not actually get if the visualization is actually working for them the most interesting thing to me in a newsroom setting is who gets to reject ideas like if someone comes up with a stupid idea who's actually rejecting it so if you're a young journalist like if you've gone from journalism school some editor would come up to you and be like do this and you're not in a position to refuse and that is like something that I wanted to talk about which is hazards of statistical thinking in a newsroom this happened when I was working for stuff in Wellington which was combined with the Dompost newsroom and at the time John Key famously said Wellington is dying they wanted to do a whole series and the editorial line was we're going to do a whole series which tells you that it's dying because John Key fired all the civil servants and when they said that can you produce an interactive for us and I went and obviously looked at the data because that's what you would do and then I went to the senior editor and I was like how do you know it's true and they said we wrote a story about it and I was like that doesn't really make it true does it and that was one of the hardest things for me to learn where you have to actually reject an idea where you're like this is completely untrue what are you talking about and the idea that so this will not make you popular in a newsroom if you're actually going to a journalist and you're like you really shouldn't be doing this with data set they will probably stop talking to you for a while and they will actively void you in a sort of story so I think this is actually sort of the real challenge and I really wanted to share this quote which is if your level of numeracy is so abysmal you aren't qualified to be a professional journalist I know it may hurt to read this but it's the truth nobody who knows lacks of working understanding of math, statistics and scientific reasoning can properly inform the public so that people here would agree with this I'm not sure that most people in the journalism industry would agree with this journalism schools wouldn't agree with this and sort of the level of training is not up to par this quote was actually by Alberto Cairo and I've linked to the source you can read his really nice rant about this and I really agree with this as more data becomes available as more organisations publish more data you'd actually need to know this so the harder problem from my perspective than learning how to code or run models is the ability to think and argue in a newsroom it really helps if you're a bit stubborn it goes a long way and for me what was interesting was I came as a student and I was essentially a migrant in the newsroom and there aren't that many migrants in the newsroom and what was interesting was I just decided I'm going to look at data sets to do stories because that just felt like informing myself better and what I realised was most people were not doing that and this is a harder problem to solve because a newsroom really wants to publish stories which will get clicks which matter in their opinion of this is a good story and there is a sort of a vacuum with journalism operates where if a statistician says this is not a good story the statistician is looked as an outsider it's probably one of my proudest moments in my career was when I published a drought map while I was working at this stuff where you could click at a weather station and see how bad the drought was and Thomas actually said on stats chat that that was one of the best interactive graphics that was created in the New Zealand media and this was when I was just starting I thought that stats chat actually praised something and that's the kind of vacuum you work in where the awards are given internally in industry your feedback loop is actually internally in the industry and you don't have an outside feedback loop so what I did actively was develop an outside feedback loop and this is the story that I wanted to end with how many people know about the Chinese-sounding name buyer story okay, some do so I've posted this sort of snippet from the main story so if you read into it like leaked figures support claims that Chinese investors are big influence on Auckland's overheated property market so Labour did this analysis where they had leaked data from a property from a real estate agency for one month and they extrapolated that what was the names of the buyers and they did Bayesian analysis on this and then they like these are Chinese-sounding names so most buyers in Auckland are Chinese-sounding buyers, Chinese named buyers and I was involved in when the story was happening I really corrected the data and gave it back to them and I was like some of the data is missing because I was trying really hard that this story wouldn't be published and I contacted Thomas and I contacted Edward Abram at Dragonfly and this was really an interesting experience for me and this is kind of the note I wanted to end at which this story when you look at that snippet you're like wow that's amazing that same story had these bullet points at the bottom and the third bullet point says the analysis cannot prove statistically whether a buyer is a foreigner or local it's in the same story like it's literally from the same page it's not actually like a different page I wrote those bullet points no one actually looked at it and this was my process of learning how sort of journalism approaches these stories where that will not sell the papers that headline the analysis cannot prove statistically whether a buyer is a foreigner or local is actually not going to sell a lot of papers what is going to sell a lot of papers is sort of running with that but the interesting thing was how this was perceived so at the time when the story Labour gave the story to the nation as well and the Herald published it as well and there was like a real divide which I faced in the newsroom where I was talking to all these people outside and they were like this is completely bullshit and most journalists I was talking to were like what a great story and that is like the harder challenge that if journalism continues to operate in a vacuum and we don't embrace sort of the data analysis and sort of an open way of sharing data and analysis then we're going to struggle because it's kind of hard to tell truth but what I found out with this story and subsequently over the last three years but sometimes it's harder to not publish a lie and we all know about this from last year and like what is sensationalist and what is not actually true but will get clicks will get people clicking through the website and this is going to be interesting as sort of the media landscape evolves and whether we embrace sort of data journalism and different ways of storytelling and different ways of showing how the story was actually formed I'll move away from this slide and so really the problem is that people not that journalists don't cite quantitative evidence they all do that's what legitimizes a story if you look at a new story third or fourth paragraph will have some numbers and then no mention of the numbers but it's done in a way that can be anecdotal and ad hoc and rather than being rigorous and empirical and it depends on what questions you ask of the data if you're actually not asking questions of the data which makes your stories sort of dependent on someone else's analysis then you're really doing a disservice to your readers so this is kind of the note that I would like to end with and this code is from Nate Silver and I'd really recommend you check out his website and his blog thank you so Alberto Cairo is he's the head of visual journalism at University of Miami and he's written two books which are really worth reading the truthful art and the functional art so he teaches sort of visual journalism and he's quite active on Twitter as well and he's what's really interesting about Alberto is like most data journalists he's kind of self-taught, he's taught himself statistics and all this stuff and he's quite passionate about learning so I've linked to his blog in my slides as well I try sort of so firstly it's your data analysis process sometimes you get really excited when you find something so in the interactive I did internals versus externals I found this huge sort of drop in unit standards and I was like oh yes I found something and the real check is actually going and talking to an expert in that area so that process of journalism still applies you go and talk to a domain expert in that area and they're like oh we changed our policy in 2009 and that's why you sing the dog and I was like so devastated when I found that out so the question is that if I've worked with a great boundaries, great score boundaries unfortunately with a lot of education data that's actually not available like you would have to do an official information waste to get the data and you won't be able to get some of the data based on privacy reasons as well so I would actually love to have a look at the data and I think that should be made available I'm really puzzled why education data is such a sensitive topic that we don't actually know what's happening what's happening in NCA at different subjects and the minister comes out and says pass rates are rising I don't think they won't so the first project was really interesting where I did the internals thing so I just joined the Herald so I think they released the data set at the time thinking that the journalist would be overwhelmed by it and not know what to do and that actually did turn out to be the case but I think they've become wiser to the fact that there are people out there that these data sets in the newsroom and you don't generally get people happily giving you that kind of data so mostly statistics New Zealand or government websites you kind of try as a journalist you want to try and get to the source of the data so you can make an official information act request to different departments and ask them if they have got a certain data sets so most of the cases I would actually try and get the data from the government department because there are some data sets like where we don't actually know that they exist because there is actually no public information about them so the NCA data set I'm gradually learning what actually exists looking at their different spreadsheets and if I was to do an OI request up next it would be informed by that so it's kind of a phishing exercise okay well we'd like to thank Hakewell for being a very interesting informative lecture and people who work in other areas of statistics will recognise the conflict between statisticians and their superiors who don't necessarily know or care so much about the fine details so Hakewell, token of our approval and let's thank him again