 Tena koutou katoa. Tadai am gana bai, as Fee said, am gana bai discussing my experiences during my PhD, building large corpus based on the official online version of the New Zealand Parliamentary Debates or Hansard, as well as an insight into the narrow institutional politics of Parliament and political elites. As a researcher, I'm interested in the parliamentary record as a lens of society, culture, history and ideas. Why is this talk maybe relevant to folk in Glam? For those involved in research on their collections or working closely with researchers, you might be interested in the corpus itself, but maybe more so in the methods that I've used. For people involved in putting collections online and content online, you might be interested in some of the hoops that researchers have to jump through and repurposing digital resources and take this as encouragement to continue engaging with the research community in this work. I'll explain why there's a radio in a bit. So first I'd like to say thanks to NDF for having me and a big thanks to the University of Canterbury for funding my PhD research with a scholarship, my supervisor is Bronwyn Hayward in political science and Kevin Watson in linguistics and my lovely colleagues at the Arts Digital Lab. Representatives from the lab talked yesterday about the project's understanding place and the lab works closely across the humanities, social sciences and fine arts at UC and the strong links to the Glam sector. A bit about myself. I have a professional background dating back to 2000 as a software developer, working with web-based applications and web technologies around a development shop in Christchurch. My work on Hansard was part of my PhD research and in the post-op position I'm going to be continuing to develop this and apply to build other stuff and apply digital methods for new research. That's a bit about myself, but the idea of speech with data I'm going to get a green MP to introduce this idea, so hopefully the audio works. Basically this is a speech with Kevin Hague and the House representatives and he quantifies the use of the word economy, the word growth in businesses and speeches by the National Party over time. He makes this point that John Key isn't using the words climate change, isn't using the words poverty and I want to use this as a way to introduce the idea of using speech as data, getting it kind of quantifying and counting it. Oddly enough in my research I was studying the use of the word economy and I was applying methods developed by Corpus Linguis. Corpus Linguis is according to the classic definition of it the study of language based on examples of real-life language use. A Corpus is a collection of a lot of texts, examples of language use collected together in a standardised format but there's no stipulations about that really. But in a way that can be processed by a computer for analysis and maybe enriched with further annotation. A basic claim of Corpus Linguistics untie my tongue. A basic claim of Corpus Linguistics and a lot of digital methods is it's a way to surpass your intuitions about the data you're dealing with so there's no meaningful patterns within it. So I was interested in pervasive patterns of use of a word-like economy and to place it in perspective by drawing comparisons. So a bit about the PhD research that this was part of. This combined political psychology and Corpus Linguistics to examine the use of the economy in the wild. And by wild I mean parliamentary debates and I built a large Corpus of 1788 hours of talk-back radio. I didn't listen to all that but I had some software listen to it. There's a body of academic research that's concerned with lay economic thinking and past research has been often conducted by economists and economic psychologists and they focus on demonstrating the ways in which lay people deviate from the judgements and knowledge of economists and they typically see that as a problem. Or, you know, on the other hand, this tends to be the psychologists. They use interviews and surveys to attempt to uncover hidden structures within our cognitions about the economic sphere. Instead I was focusing on observable language in these contexts, connecting public thinking about the economy to the context of political arguments and ideas and studying common features of argumentation to provide new insights on lay thinking as a kind of public thinking. So I'll talk briefly about the Corpus. Here's a photo of my son reading a volume but next to him there's a wall in the political science department at Canterbury of almost a complete set of the print volumes. Racing forward to kind of the present, obviously we have an online version of this and the current parliamentary website contains debates from 2003. At the time I started looking at the online Hansard in 2013, it was quite slow to search and really wasn't conducive to cutting up in the kind of way I wanted to for my analysis. So my first instinct was to look at how it could be repurposed. So here I've got a representation of the source of the HTML and within it you can see a bunch of markup, marking particular kinds of speeches, there's indications of timing, even like the page that this would relate to in a print volume. The speakers are represented in a common way so I thought this would be conducive to cutting up in some way. I'm not going to go into a lot of detail about that but essentially the way a lot of people would approach this is to use an XML parser and use that as the groundwork but there was a lot of inconsistency in the way that it was represented well formed in places. I hope no-one from Parliament's here. But obviously over a 10-year period which in 2013 I was working with there's things changed over time, the way that things were being marked up over time. There was a period of a week where it looked like their system went down and they were cutting and pasting from Word documents or something like this. You find all this stuff out when you scrape. So at the end of the software-driven process to cut this thing up there was a database containing all the separate utterances and procedural text coded with the debate that were from the type of debate, the date of the debate and who was speaking and their party and there was an additional stage of processing to make searches and comparisons very fast. So I updated it in 2016 and this is what I used for my analysis. So there was a 13-year period covered 57 million words, almost 400,000 utterances and 261 speakers represented. So I built a tool that went along with this. This is what I was hoping. Okay, and similar to what Tim was saying this is a rower working thing, I developed it enough to do my analysis but something I want to do through the postdoc is release a public version of this for the public and researchers. So I'll demo some of the basic browsing. I thought I'd pick on librarians. So this is the classic call for linguistics tool. When you search the keyword is in context down the middle and you get to see some text either side of it and this is the basic interpretive tool that even if you're using high level quantifications you're still coming back to this to check the robustness of your interpretations. So I'll pick out one that I quite liked. So you see when you click one of these lines it races to the point in the debate. May the rage of a thousand librarians rain down on the heads of the National Party members. I'm sure there's probably some people in this room that would agree with that. Sorry. So part of the tool is it allows you to expand out the context to see things in the context of debate. So in question time interactions you can see the sequence of turns what people are responding to and so this was a basic move by the kind of corpus work I was doing is to look at levels of analysis, look at these big quantifications down to the actual utterance level. One other thing that's quite... So I've made it so that you can highlight specific text in there in the document and it'll create the concordance for that. So I guess kind of similar to what Tim's doing, he's annotating the text with linked open data this is kind of a way to think of it as annotating the text with itself and kind of seeing context across the corpus, across lots of examples of usage. At this point it's good to acknowledge that Tim's actually done some work on this. He's built a tool for 80 years of the Australian Parliament and there's other people in the UK, the Hansard at Huddersfield project trying to make this more useful for researchers and more accessible for the public. What can researchers learn? What can we do with this? So I'm going to take the case of economy and talk about some of the a small part of my research. Firstly, in terms of use of economy we can quantify its use over time and compare it and here I've got this graph shows the average patterns of use by three different parties and aggregated over different parliamentary terms and it's an average over every thousand words so that you can directly compare them and we can reflect back on what Kevin Haig was saying and the thing I couldn't play was he was talking about the use of economy by the National Party and obviously in the period from 2008 there's a financial crisis and a change of government. It is a very noticeable feature in comparison to the other parties it increases dramatically but interesting also before this 2008 period National comes behind Labour and the Greens in their use. In addition to this example for my research we can look at co-locations or words that predictably occur in a limited span with economy and here I'm showing words that co-locate across sentences across the whole corpus so you might expect words to do with growth it gets more interesting when you compare parties on this so you'll notice the two major parties have highlighted the common words between them and particularly top five words. I talked about a shared vocabulary or shared language related to the economy and also this I dominate an idea of growth and to be able to quantify this assumption that's kind of prevalent in this kind of speech the National Party members when said growth or one of the other grow words once in every four sentences Labour parties use grow words once in every six sentences where they mention economy the Greens in contrast mentioned it once every 20 sentences and this was often in a critical sense in addition I was quite interested in this word our because this is our economy the academic literature is focused on the emergence of this idea of that economy as a kind of a separate abstract independent entity that we do things for in our politics but along with this kind of abstract thing that politicians were appealing to they're also appealing to our embeddedness the embeddedness of the economy it's for people in the end so I've mainly talked about language but and moving away from the economy there's other things you can kind of take from this data grass of this kind can't be, there's no data to do this kind of stuff at the moment so this is looking at Green Party votes across years and the percentage of yes, nos and abstentions and you'll see under Labour they're tending to less more under national vote no more but there's still this kind of strong no voting under Labour which is quite interesting given they're in coalition now also, I mean we often talk about representation in parliament and equate that with who gets to speak and who has a voice so using this data we can look at who actually gets to talk across parties, across genders and so on here I've picked out a party that I've written a bit about the Greens comparing 2006 where there's quite an equal share of speaker allocation to 2012 where there seems to be more of a hierarchy and these are patterns that aren't easily insurable from just listening to parliament okay, so what are the opportunities the key the key opportunity I think is to extend the span of the corpus so I've stopped in 2016 when my analysis ended but obviously there's this interesting period recently with a change of government but there's other sources of Hansard now going back to the earliest volumes there's been scans made available by the Hathi Trust there's a prior mirror of previous online records prior to 2003 and I've recently noticed that there's some Google Docs link from the wiki that look kind of official as well though I'm not sure of where they come from in addition to this there's videos that could enrich the way that people interact with this kind of tool so joining these sources up and putting them into a comparable format across the entire Hansard record would have lots of research applications and it's a challenge I've been looking at for a while prior to the Hathi scans being available I piloted scanning and OCRing the debates myself and thankfully this isn't something I have to worry about anymore but I'll talk about the challenges there's a basic problem of changing reporting practices this is a problem that anyone working with Hansard needs to be aware of that there's transcription editing practices that change over time and that something researchers don't really know a lot about especially once we get further back in time second with different data sources there's different representation practices even just down to the way columns how many columns there are in the print version whether macrons were being used or in one case removed from one record one record source and how we might identify speakers and procedural texts within the OCR record itself so I've trialled using Bayesian classifiers to do this with some success thirdly there's just some basic errors both in bad scans then record and bugs in the online record which as I mentioned you probably only notice if you scrape it why should you care to put this another way what can librarians, archivists and others in cultural heritage learn from this part of my hope with doing this talk is to provide some interest in this form of analysis I'm aware that there's university librarians at NDF and I would love to hear from researchers and others interested in the form of analysis with Hansard or other sources I'm also keen to provoke dialogue with people interested in the possibilities of collaborating towards joining up the parliamentary record I have this rough idea of a wiki interface to crowd source improvements to the scans to find specific pages that are problematic and to correct the texts of significant speeches perhaps we could discuss this more in question time also I think it would be useful for folk in Glam to understand some of the problems that researchers have in repurposing data in the way I have especially as some of you will be involved with decisions about making stuff available many of you will be thinking well the answer is an API and if we build an API the researchers will come and this is something I've kind of been dealing with a digital NZ API recently which is great but it's interesting comparing institutions the quality of differs I was comparing for example dissertation records across university repositories six out of eight of these institutions said what degree the dissertation was related to two don't so that kind of leads me as someone who can code to start scraping that from the repositories himself beyond the whole builder and they will come idea of APIs researchers need documentation on how the records were created and the decisions involved in this this is things you'll know but this is something that probably by continuing the dialogues with the research community and this is perhaps an encouragement to continue these engagements so I'll leave it there, I had something else to play but that'll do we've got time for one or two questions any questions out there Tim's got one that's great Jeff I mean my question is sort of obvious and that's the possibilities of international comparisons because we have the Australian Hansard even though there's some licensing issues around it I think Canadian Hansard is available now so I'm wondering what you think what it might emerge if we can start to actually explore these different bodies of political speech across countries yeah I think that would be I mean the economy is interesting from my own research but I think this would be interesting to see this across in comparative kind of context and one other thing I've looked at the Australian records and I kind of played around with that and there are some problems of kind of comparing formats but yeah I think this is some people are already doing this kind of stuff I know I'm aware of some Canadian researchers that did stuff with European parliaments but yeah there's kind of a wide open kind of field really in terms of these comparisons yeah Kia ora Jeffrey you mentioned problems with documentation and the source and the markup did you think of contacting the officers of the Clark and asking them about the inconsistencies I did think of that but I'd kind of done it already and this is kind of the hacker thing is just kind of do it you know kind of solve the problems as much as you can yourself and I'd really like to talk to the Hansard team about some of the things I'd found with this I've found one little bug that I'm sure they'd love to know about you know it's yeah it's I mean it's just it's a big reflection on their work as well when you work with this for a number of years you see the amount of work that's going on to put this thing together so yeah Thanks Jeff Rowan Payne Digital NZ just to follow on from that previous question I just put it out there that also you know issues with with what you found through the Digital NZ API you know we can also work with that I appreciate that that's you know That wasn't a criticism at all I sent an email about something else and got a nice prompt reply so cool it's the nature of the thing my point is really that the API is kind of it's about what data you're exposing and how and research is shouldn't really I don't want to be constrained necessarily by the particular ways of classifying those records so I've kind of got the privilege of being able to code around it but I guess it speaks to how the APIs are what's being exposed Hi, I was just wondering how far can you filter it so you can do it by parties could you do by like individual MPs and analyse their specific language in use of words? Yes, yeah so I've done a little bit of work with this with John Key and direct comparisons with Helen Clark so there's kind of a particular speech that kind of belongs to prime ministerial roles and then it's working out what's different and yeah, this is something that I think you can do it with the Hansard search now is bring up individual speeches by parliamentarians but at the time I kind of kicked this off you couldn't and just being able to see all the speeches you know and decontextualise across time one speaker speech after another is quite useful is anyone from Hansard or the parliament here? Oh sweet, I'll talk to you afterwards Oh yeah OK OK I was going to ask exactly the same question because I had an inkling there was probably someone from Parliamentary Library or from Parliamentary Services in the room or at least NDF and I think by the number of questions you can tell that you've got a pretty engaged audience and I can think of several people some of whom are in the room who you need to talk to so hit me up afterwards and yeah, really exciting work and yeah, there's some people for you to meet OK everyone it's lunchtime, can we ho mo te pake pake for Jess?