 Okay so let's get underway. Many of you have probably already met Lauren Vaughn who joined us a couple of months ago as an ELDP post-doctoral researcher. Lauren's project is, she's not going to talk about the project that she's working on today, she's going to give us a chance to hear about that at the later stage, that she's working on a minority Tibetan language called kagate which is spoken in Nepal. She's done some really interesting work on evidentiality in particular in this language. She came to us after having spent some time as a post-doctoral researcher at Nanyang Technological University which is based in Singapore and during that time when she was there, she was actually able to do some field work on kagate and to collect some data in advance of a salvi project here. Her PhD, she completed two years ago at the University of Melbourne in Australia and she worked on Yolmo which is another minority Tibetan language. She's done extensive field work on Yolmo and has written up analytical material on that in her PhD which we hope will be published before tomorrow. There is a false coming grandma. There is a grandma coming out. She's been extremely active or she is extremely active on the sort of social media side of things. With another colleague edits the superlingual blog and she twitters all the time. We haven't managed to convince her to get onto Facebook yet but I've been working on that and as you can see the slides are here are available with a nice little bit of a link to where you can get your hands on the slides. She also has been importing Australian cultural activities into the UK and as you all know Australians have close association. As a recent article showed about Australian pronunciation very close association between Australian and alcohol. So Lauren has been behind the setting up of linguistics in the pub in London. It of course doesn't have to involve alcohol but it involves convivial discussions about linguistic issues and there will be one coming up next month I think on the 8th of December. So Lauren is very active, very engaged and today she's going to tell us about some collaborative work that she's been doing with colleagues looking at issues of citation in linguistics. It's a real methodological question that she's addressing today. Really sexy topic for a Tuesday. Data citation and methodological standards in the genre of linguistics. Thank you very much Peter for that intro. There's not enough people and I feel very far away from you behind the lectern so I'm just going to roll my way out here. I'm going to have to shuffle back occasionally to make sure I'm following my notes as I want to when from. You can view these slides if you're online right now. They do look a bit better on a smaller computer screen than they do up here. Just to kind of because I'm new here so a lot of you don't necessarily know the work that I'm doing but also because it will hopefully help you understand my motivation for today's talk that I give you a brief background on kind of the part of the world that I work in and the kind of research I've been doing lately. So as Peter mentioned I work with Yolmo and Kagate which are kind of two languages within the same family. This kind of collection of Yolmo varieties and there's Kagate there spread across Nepal. For my PhD mainly focusing on evidentiality taking a new tact kind of from this point on but always happy to talk basically endlessly about evidentiality as well. These slides this map looks a lot better if you view it online you can kind of zoom in and play around. I like digital map making and this is a chance to gratuitously plug the poor mapping workshop that will be running on the 14th of December with the plants animals words team candied and I've put together a day where you get to learn how to collect ethnobotanical samples and information and then you get to map it really really pretty because I think the pretty is what's often missing from digital maps. Within the work on Kagate kind of have two main folk high within the project funded by ELDP. The first is to build a corpus of Kagate language use and we're focusing specifically on traditional knowledge traditional folk tales and this is partly so that the corpus is useful both for the community as a repository of cultural knowledge as well as for linguists who might be interested in the linguistic features of the language but I always like to think of my workers having three audiences so we have the linguistic community you guys get these super fun technical papers like today's paper the community we like to try and return useful things to them whether or not they use them as a different matter and then the general public so focusing on the community and the general public been working with art colleagues in Singapore on a project where we take the traditional stories and turn them into pretty picture books posters online media those kind of things and these are some amazing illustrations this slide does not do them justice they look fabulous by one of our graduate students working on so we recorded this traditional story about a jackal I know it's not Jacqueline Croats jackal and the old woman so we see the jackals there and the old woman it's a slightly comedic tale that involves death because they have a very grim sense of comedy the other thing I'm focusing on is the use of gesture in discourse so I'm particularly just mentioning this slide in case any of you happen to work on languages of South Asia or Southeast Asia and you recognize this gesture which if it's done with a shrug kind of has a lag what are you going to do about it effect if you do it with a head nod it kind of means what are you up to it's a vaguely rhetorical sense but it is also used with speech and I've just put a little gif of one of my speakers here because I'm not really going to analyze it too much today but if you are familiar with this gesture of kind of the Indian Nepal area and you have some data I will happily chat to you that's the plug for my gesture stuff today though I'm going to talk about a project I've been doing with a couple of colleagues Andrew Burres Croker and Tyler Heston in Hawaii at the University of Hawaii and Barbara Kelly who was my PhD supervisor and is still stuck with me at the University of Melbourne as obviously not Tyler that is Jack Lord from the original Hawaii 5.0 Tyler was one of Andrea's graduate students and is a notoriously difficult guy to pin down online so that is his current alias for you all there this project came about because we all work in documentation and description so the data that I'm going to talk you through today is heavily biased towards a focus on that area because it's the area we all work in and we're deeply interested not only in doing this work but thinking about how we do it and why we do it and the kinds of outputs that we might create from our research and we're currently writing this paper up for submission very soon so whatever you think about today's paper you can ask questions and I'll try and address them either now or in a later written publication. So I'm going to kind of make a bold statement and I just hope that you're all willing to go with me on this today but like other, like in sciences particularly what we think of as hard sciences we have this idea that claims should be falsifiable and reproducible so if you're going to make a claim about the state of the world other people should be able to assess that claim based on evidence to hand in order to be able to come to a similar or potentially different conclusion and I'd like to think that in linguistics if we treat linguistics as an empirical science that we value that kind of reproducibility too. If we look at the field of language documentation in particular this has been discussed more and more frequently in the last couple of decades although it's not just constrained to the last couple of decades there's some wonderful quotes from Malinovsky about the benefits of recording linguistic information to analyze more empirically. So it's something linguists have long thought about. There is a lot of discussion about reproducibility and replicability and I'm not going to get too deep into the kind of semantics of these but generally reproducibility is about access to other people's data to make analyses and see if you draw similar or different conclusions and replicability is the ability to recreate the entire experiment. Obviously language documentation and a lot of other fields of linguistics make it very, very hard to replicate from beginning to end an entire data collection especially for something like discourse analysis that's very context dependent in terms of the information that you're analyzing. In terms of how we do the work that we do just looking specifically at language documentation there is a wealth of great information out there explaining how to do field work, how to collect linguistic data and increasing genre as well of how to write grammars. So Gipet et al and the Nakayama and Rice LDNC special that came out as well as many articles over the years in LDNC, Language Documentation and Conservation and Language Documentation and Description Volumes have talked about these topics. So we have a rich literature on how to do the kind of work that we do in collecting and writing up language data but very few of these explicitly discuss in particular detail things like data management, citation and attribution of linguistic data. There are obvious notable exceptions to this. However on the whole we're very much focused on how we collect the data but not necessarily on how we then go about sharing that data or displaying that data to other people. And part of that is the history of the genres that we work in. So when we think about the Boazian history of language documentation with the focus on kind of a trilogy of grammar, dictionary and text, then just implicit in that is the idea that the underlying text is somehow separated from the analytical description. And it's important to remember that grammar writing is not an atheoretical pursuit. I think there are some people who sometimes feel that that's kind of the thing you do and then it's up to typologists or those who have a particular syntactic theory to come along and then use that data to form their theories. But the way that we present our linguistic data is actually making a big claim about what we think language is and how we can analyse it. And so these old habits can be hard to break. So even though we have a lot of really great literature on how we do language documentation and linguistic description work, we don't necessarily have a habit of writing explicitly about how we do that ourselves. And so we came to this project with some fairly open-ended questions. Before we can think about what we might want to aspire to or what we might want linguistic science to look like in terms of citation and reproducibility, we wanted to take the opportunity to look at what the state of the art currently is. And so we decided to conduct a somewhat large-scale survey of how things are going. So for our study, we examined 100 books. We had 50 published grammars and 50 dissertations. And these, we kind of, it was a sample of convenience, I'll talk about that briefly, but we were trying to get a range of authors from a range of institutions, countries, languages of focus and publishers kind of trying to distribute that kind of variation. And we looked at 271 journal articles across nine journals. And I'll discuss what journals we looked at and why briefly, but we had about 30 from each of those journals. The time period we chose was 1998-2003. And that was deliberately based on the fact that the Himmelman 1998 paper is often taken as a fairly seminal work in terms of positing language documentation as a pursuit worth pursuing in and of itself and thinking about the data that we collect from that as an object that's worthy of kind of explicit discussion in and of itself. So we figured at least five years after that everyone's kind of taken on board the lessons from Himmelman 1998 if they were going to take them on at all. So just to briefly discuss the journals, we chose nine journals overall. There are four aerial journals. I think it's important to state that even though they may sound a lot like they're just a similar genre to descriptive monographs and grammars, a lot of articles do have a really strong theoretical or typological point to make. So even though they're aerial and a lot of the people who write for these journals are language documentation and description people primarily, the focus of these articles generally was theoretical or typological. We had two targeted subfields in socio-linguistics and second language acquisition. We had some divergent theoretical persuasions to try and kind of see if there was any variation there. So we have natural language and linguistic theory and studies in language. And we put language in there as well just because it is one of the top journals in our field across all genres of linguistic study. I've just put the distribution by year of publication up there, not necessarily because it is in and of itself that important. But because we had a much larger range of readily accessible journal articles, we could pull off every article for the ten year period for each journal, we randomly selected across the years 30 articles per journal. It's important to note we had a lot of trouble with LTBA, and for this I feel particularly sad because it is the linguistics of the Tabeto-Berman area journal, so it's kind of like my journal. We couldn't get articles from a certain number of years because LTBA changed publisher and all the articles from the previous publisher only exist physically for now. Hopefully that's being rectified, but it's kind of a bit of a tragic state of affairs. For books we had to go for a sample that was much more based on convenience in terms of what we could access. Also just turns out 2009 was a rubbish year for publishing traditional books, so published grammars are in red today and dissertations are in blue. The dissertations also skew towards the end of the decade, that's partly because people are getting better at putting dissertations online, but generally dissertations are really really hard to get hold of. If you are writing a thesis, please work out something with your institution where even if it's embargoed for a little while, you eventually make it public because there are so many, but potentially I don't know I can only read the title and the abstract, there are so many amazing sounding dissertations in the world that basically do not exist. If your dissertation is only on paper, in a repository and it's hard to access, then it, yeah, that's my that's my rant for today. If you're going to do work, think about how other people might want to actually read it. In terms of data coding, we kind of had two broad areas and I'll talk through each of these in turn and I'll keep coming back to the slides so you remember what they are. The first is looking at methodology, so we're interested in how people talk about the works that they did. So we wanted to look at things like whether people talked about the participants in the research and the people who participated in the data giving process I guess there. We also wanted to look at data collection equipment and data collection tools. So for equipment by that we mean things like recording devices if any were used. For data collection tools, for that we mean any kind of potential elicitation stimulus or surveys or experimental structures that were conducted. Data analysis tools and software is kind of the post-collection analysis stage, so if you use PROT to analyze phonetic data or if you're interlinearizing using Toolbox or Flex or using R for your statistics mentioning that time spent collecting data is obviously going to vary depending on the linguistic genre, so this might be the amount of time you spend in the field or it might be how long a participant had to or how long an interview was conducted for but talking about kind of time frames that you were working with and looking at the type of linguistic genres. We then had some data related variables. I'm going to talk about the methodological variables all together so I've just kind of listed them there but I'll work through the data related variables one by one. So these are the source of data so where does the data come from? For the journals we broke down all of the genres that were analyzed so I'll talk about that in relation to journals. Looking at where the data is now so if it exists outside of the publication and the citations conventions used to reference the data if any and that kind of comma if any there is somewhat important because what we find is that this is not something that is necessarily commonly done in all genres of linguistics. So today is kind of a chance to I'm going to kind of share the findings here but I hope it's also a chance for you to reflect on linguistic methodology and data citation practices in your own work, in your own subfield of linguistics and how it might relate to the findings here. So looking at these methodological variables first. For journals we simply had a binary question looking at whether there was some kind of description of data collection methods and we set our benchmark pretty low here because journals have really strict space constraints they often have a very specific focus so even if your methodological discussion was a brief passing reference in a footnote you got a yes from us. And we find a really strong idea in terms of the journal. So studies in second language acquisition which I think other data that I present today will make clear is a strongly experimental focused journal. So almost everything that's published in studies in second language acquisition uses some kind of test or experimental instrument to test a hypothesis about language acquisition and so because they're kind of working experimental framework the genre overtly kind of demands a kind of methodological section as often overtly marked methodology or something to that effect. Other areas of linguistics don't have the same genre expectations and so we don't find necessarily as consistent or clear use of or discussion of methodology. On average I think largely helped by second language acquisition journal there. On average 40% of articles will overtly discuss methodology but there's some really strong journal variation going on there. In terms of the kind of features that were discussed we decided to condense data collection tools and or equipment for journals just because these were categories that we originally created for the monographs and it didn't seem necessary to kind of expand so greatly for journals and also as I said the genres collected we broke that out much more qualitatively later on so I'll talk about that further down. A very strong focus on participants with a kind of decreasing methodological focus from there. So very strong genre differences across journals but overall if they do talk about anything they at least mentioned participants. Again we set the bar pretty low there even just naming a few participants in your acknowledgments for a journal we decided was sufficient for this category. In terms of the I'm kind of lumping documentation dissertations and descriptive monographs together in this titled books but the books I'm talking about are very descriptive linguistics focused. I've broken it down by year not necessarily because it's very important but I think to just show that consistently dissertations outperform published monographs in terms of how many categories. So just instead of so for journals we simply ask do they have a methodology yes or no. For published monographs we decided to use a Leichart rating from one to five as to how extensive the methodology was. But what we found is that correlated pretty neatly with how many of our six features they mentioned. So more comprehensive methodology unsurprisingly mentioned more of the categories. So it's slightly more objective for us to talk about how many of these six methodological features they overtly discussed. I should mention that these features that I'm discussing today came organically out of what we found were features that were most often discussed in the documentation and description literature and then comparing that against a initial sub-sample of ten of the grammars. And we felt that that kind of was a minimal set of expectations we could ask of a rather good methodology. And I'll talk about that more towards the end I don't think it's necessarily the only things that we need to include in a methodology. But these are the six we focused on as you can see people generally managed especially with dissertations to talk about four of those six categories. We have two hypotheses about why dissertations outperform published grammars. I think the most likely one is simply that if you wrote a dissertation between 19 2003 and 2012 that's probably based on data that you collected kind of within seven years of that. So it's potentially relatively recent work PhD training is giving you kind of access to the contemporary literature which you can put straight into practice. Whereas if you have a published monograph in this ten year period that may be based on 10 to 20 years of field work. For example, I know Carol Jeanetti's grammar of Dolla Canewa that I often refer to was published in 2007 but she'd been working on that for many, many years. And so it's not necessarily if you're starting from scratch and you're reading the literature about making your methodology clear you've read Peter's article on metadocumentation and you include that in your description you're much more likely to do these things. Carol's grammar instantly did do very well in our ratings. So it's not necessarily just about how long you've been writing a grammar for. The other theory is that potentially published grammars and dissertations are two slightly different genres and because in a dissertation you're attempting to demonstrate that you really do know everything and you are very diligent as a researcher you may be more likely to go above and beyond in including kind of these methodological features. But either way the students are outperforming their teachers that's not a bad thing I think. In terms of the features discussed you'll see that some features are more frequently discussed than others. So for grammars all we asked is that people overtly critically considered the range of genres that they had in their dissertation where and so a lot of people did that so we had 50 grammars and 50 dissertations so that's 42 so almost everyone of the dissertations discussed that topic whereas for some other topics people are much less likely to discuss those. So methodologically we have some very very strong performers but on the whole not always so strong. In terms of the data related variables so that was talking about how people talk about their research context whereas when people talk about the kind of data they use I'm going to go again through the journal data and then I'll go through the book data. So for journals there is a variety of source data that people might be drawing on their own research on existing published data on unpublished data be that unpublished or in terms of manuscripts or in another person's field notes introspection still is a genre that exists in linguistic analysis although it may not be so frequently drawn upon in language documentation and we have source data whether it was presented or not applicable and not applicable was to things like if all the data was presented was summarized journal information or sorry summarized experimental information or kind of generalized over phonetic information for those kind of things kind of thinking about the source of data is not always applicable. What we see here is that there is still an overwhelming focus on using one's own data in linguistic analysis and those who aren't using their own data are generally using something that's published or we simply don't know where the data comes from the general assumption there seems to be that it is probably their own but it's not necessarily made clear these Pareto charts are actually very neat I can say that because I didn't make them myself this is Andrea's R wizardry that deserves to be recognized here what we see is the frequency out of the 471 papers on this and that's recognized as a percentage and that's how a cumulative percentage along the tail there so what we see though is that there is quite a bit of source data variation across the journals so for studies in second language acquisitions, sorry the slides are not as clear as I hoped, studies in second language acquisition overwhelmingly their own data experimental data that they collected themselves and then wrote up. For LTBA we see a very strong focus on published work so drawing on existing word lists or texts or analyses to present one's own and then in natural language and linguistic theory overwhelmingly well not overwhelmingly but between unstated and own that seems to take up most of it with some published work as well so depending on the journal we see some variation in where data comes from in terms of the data genre we kind of let this filter up and collected whatever categories people discussed in each of the articles I'm not going to go through all of them I just think it's really nice demonstration of the breadth of the kind of data that linguists are dealing with and I think some of them kind of capture some of the challenges that we have for data citation and how we go about citing data because different linguists are working at very different levels of analysis so we have something like a spectrogram compared to something like a questionnaire or comparing that to an entire story or a text or songs these are very different genres very different types of language data we need to think about whether we can reconcile them into a single way of thinking about data or if they are going to require their own ways to be considered across all the journals there is still I mean there is a very very long tail there it's really go through all the reds and yellows but there is still a really strong focus on sentential level lexical level and text which I think indicates that we still have a focus on what might be considered traditional domains of linguistic analysis across across the field of linguistics in terms of individual journals I pulled out the journal of African languages and linguistics studies in language and studies in second language acquisition and we see some very different focuses there so for African language and linguistics a very strong lexical focus possibly a lot of interest on untangling noun classes and complex verbal morphology whereas for studies in language it's a much more text heavy and in studies in second language acquisition tests and experiments take up the first two in terms of where the data is now I've given the kind of categories we discovered so archived if the data is published so maybe that they are working with an entire published collection of texts and that is in and of itself all the data is available all the article contains the primary data or a summary of it whether location is unstated or online but what we find is that unstated is overwhelmingly the majority for our journals so for over well it's just over 200 of our articles we simply don't know where the data is now so published is a distant second and then a summary but we can see archived as this tiny little blip down here there is the only journal where unstated isn't the overwhelming first by the way is oceanic linguistics where it's neck and neck with published data so I think there's a point to be made here that and a lot of these themes and where linguistics could possibly be doing better is part of a larger cultural attitude about data and what it means to acquire data what it means to own data and work with data and I think one of the biggest cultural hurdles we have to overcome as a community is thinking about the role of archived and the role of publicly archived data and kind of a lot of the conversations we've had among ourselves and that I've had with people when doing hosting events like linguistics in the pub is that there is a real anxiety about making your data accessible to other linguists and it's an anxiety about being scooped on your own analysis and I think that in a lot of ways actually the kind of the reality of someone else coming in and scooping you on your analysis of ergotivity in a language are in many ways unfounded given the realities that everyone is actually very busy working on their own data but I think it's part of a larger cultural problem that we need to think about in terms of recognizing the importance of archives recognizing the importance of corpus creation and if someone does come in and presents an analysis of your ergotive marking ergotivity in language you've created a corpus of that we need to formalize while recognition is given to the hard work you've done of building that corpus and I'll discuss citation next I think that's one way to do that but overall there seems to be a reluctance to archive work we found so in terms of citing your examples and explaining where they came from we found even with very broad kind of categories there were about 18 different ways that people would recite their data I'm not going to walk you through all of those this is a talk that Andrea and I gave at a Delam and workshop which is the archives network for language documentation and music that lists all 18 of those conventions I was going to talk you through a couple of them because I think as I said before it illustrates nicely some of the problems we have about the different levels of linguistic data and how we might refer to each of them so we have if you're citing an example from an existing document that's quite easy you can rely on your standard APA citation convention cite the author, cite the page number that's quite reliable if you're citing a bible quotation again there's a nice standardized conventionalized way of doing that it's a lot easier but citing your own work as we can see there's a variety of ways of doing that you can cite to just naming the language so this is a very global level bit of information where you simply give the language name you could refer to the speaker and give information about the speaker as your citation particularly possible in the general of sociolinguistics sometimes it's just initials sometimes it's first name sometimes it's a pseudonym there are a variety of possibilities here and sometimes more or less metadata on the speaker is available here we see the speakers gender and age there's additional metadata about the person in the paper or not the citation can take the form of the title of the story again this could include additional information like the speaker name but it kind of is focusing on the level of the text rather than the speaker and anything that the speaker may or may not have said citation conventions can be used to the level of example of the code that's explained by the author so here we can see that the author has told us that speaker BN on tape 3 transcription page 12 said this sentence and provided that this author has at least labeled there's a lot of provisions here but if this author has labeled the cassettes correctly if we know where the cassettes are if they've kept their transcriptions and we can resolve the transcription through the cassette we can eventually find our way back to that original utterance if we wanted to verify if we wanted to say I don't I don't reckon we would hear a voiced fricative at the end there I think we'd hear something else we could actually we've put a little bit of figuring out if we had access to the original information figure that out in a way that might be a lot harder if all we're looking for is a story called Broccoli which may be two hours long so sometimes we also we also have a category for where people had used some kind of code but had not explained what that code was and there are two possibilities here either the person themselves knows and it's kind of more for their record or it's just a way to kind of make it look more official probably a little bit of both there we also had to include these categories if something appears as a reference to an unpublished manuscript if they did not include any form of citation if there was no numbered examples or anything that we could cite and we had other and other was for kind of interesting cases that didn't quite fit any of the other categories for example we had one where interviews were conducted and the name of the person who conducted the interview was given as the metadata point so it didn't really relate to this it had nothing to do with the speaker it was more to do with who had conducted the interview which didn't really relate to the linguistic information so that one in our other category now I could have gone through all 18 categories and explained them in depth and which ones are potentially better in terms of citability and retrievability but actually the main takeaway from this point is that overwhelmingly people do not cite data or our next best option is that they do some kind of standard citation and after that we get the very exciting just to kind of blanket description of the data and that's kind of as much as you get for citation so even though some people are putting a divergent range of conventions into practice actually the majority of people are not citing conventions and not using any citation convention and so we have this aspiration to being a kind of objective endeavor but we're not really living up to it in terms of how we do our work and if you ever come to one of my data management talks the main takeaway even if you're not really interested in other people accessing your data and you don't want them to ever get their hands on that ergetive before you do if there's one reason I can try to motivate you to cite your work correctly is because as an overwhelmingly lazy person I have to say that it makes my life a lot easier if I can go back and re-listen to something and I don't have to spend half a day kind of tearing through recordings or trying to like weird use weird find techniques on my transcripts if I can just go oh I took that from that recording isn't that nice even if it's just for your own sake citation is useful just very briefly looking at these same questions as the monographs and mainly just flagging differences to the journals is that the source of data is much more consistent for monographs unsurprisingly they're of a single genre the dissertations overwhelmingly mentioned or entirely based on original field work that's not too surprising these all add up to more than 50 by the way because some people would draw on their own field work and implement that with existing materials in terms of where the data is now though linguists aren't necessarily in documentation aren't necessarily doing a whole lot better than anyone else there's this really great I really love this category it turned up a lot in language documentation and description this is where people in their methodology section or area or paragraph depending on how comprehensive they were would say something like oh and the, at the conclusion of the project it will all be archived with ELAR or parody sec or the university and I think the strike rate I think about two out of those five grammars there is actually some evidence that they have archived their data but it's kind of a it looks really nice to say that you'll archive it but then you realize it actually takes work and time there are a range of different options people have had for sharing so they might share it with the community or put the information online in terms of, well you can see the group that I'm clearly angling for there in terms of the citation type used because it's a more consistent group we kind of, we bind them into five categories of how detailed the citation types were so a lot of people had no data citation which meant that if I opened up a grammar and pointed to one of the 1058 examples in there it's quite possible the author would not actually be able to retrieve that from a recording or elicitation I mean they might have a phenomenal memory but if you can't just call them up and ask them that makes it a little bit inaccessible some people for most examples would give some kind of minimal reference to the speaker or the story when referring to examples or there would be some kind of reference that would call back to a corpus that may or may not be included in the grammar or a metadata list of titles may or may not be included in the grammar so you'd get something like, oh this was a speaker or it was this story and you can actually chase that up if you tried to we had quite a few people who would give us a nice kind of retrievable tag but would not necessarily explain if it was archived or where it was archived or if we could ever make use of this tag and finally in this tiny little section up the top here we had some kind of code that was fully resolvable to the underlying corpus that had time codes and was archived and I've pulled together all these examples and shaming other people these are all kind of creative embellishments on my own corpus and for each example in the corpus I've given the speakers initials this is the archive label with paradesec soon to be Ilar as well that's actually the date but it's together it's a file reference and the time code if it's a spontaneous or kind of naturalistic example and that kind of thing is somewhat, I mean everything in documentation and devil's a time consuming but if you invest that time from the beginning it's a lot easier than attempting to do it post hoc I'm not going to go through all of those citation conventions with the examples but I do just want to flag that this is just one example of how in many ways language documentation is quite a good model for how resolvability can work if done well so this is from Valerie Gurren's thesis from 2008 so she has some nice codes here she gives a very clear explanation of this in the methodology at the beginning of her grammar if we look at that reference she says that the data is archived with paradesec and if we actually go to the paradesec page where it's archived all I had to do was search that little five digit number and it pulled up it took me oh no it's elar isn't it sorry I've been working with both of them they've kind of merged in my brain I like the mode I'm not going to accept that it took me about 30 seconds to get to this page and if she archived with paradesec it would probably take about the same amount of time if you're willing to sign up with the endangered language archive and their access policy you could go listen to this right now by following this link so it's nice and accessible for other people to listen to those examples if I was a researcher who might think oh that's a very interesting form for the first person in languages of this area then I could confirm that it's definitely that it wouldn't take me a whole lot of time and you can get some extra metadata about the recording there that's really nice so it's kind of hard I have to face the fact that on the whole people aren't doing that great in terms of being transparent about their methodology and they're not doing that great in terms of making it clear that they're writing resolves to the underlying data so we've discussed some of the features that we think that have kind of come out of our discussions about what we've thought are important basic features of methodology to consider and we believe that research should link to underlying data so that example of the gesture that I gave you that came with this utterance I employ this kind of practice in my work so we try very hard to practice what we preach but if you included Barb's Grammar of Sherpa in this analysis, hers probably wouldn't hold up that well either and she'd probably be very frank about that fact but I think it's important, it's not up to us to kind of stand here and harangue and tell you what we think may or may not make a good descriptive grammar methodology or what might make a good minimum citation standard it's up to us as a research community to articulate what we value if we do find this important if we think it's important that other people be able to go back and listen to the examples that we think are canonical examples of certain phenomena in the language as we work in and part of that is to encourage good practice I don't think it's necessarily always about I mean we can improve our own work practices but it's about creating a culture that kind of is much more about transparency and accessibility of data and sharing this part of that work at the moment is an NSF funded project on developing standards for data citation so although it is up to each of us to make changes in our research practice and try and make our data more accessible I don't think it's necessary that each individual should have to come up with a data citation standard of their own and nor do the people so Andrea Berez-Croca who's one of my co-authors is one of the PIs on this grant principal investigator on this grant and they are planning to do some of the difficult work for the rest of us and come up with some kind of style sheet for kind of best practice in attribution they're working with Delam and the archives network to ensure it's kind of going to work with archiving so it's about even though it's about improving our work it's doing it in a way that says minimally obtrusive and based on the kind of standards that we're already using we can encourage our students so Andy Pauli has a really sweet article in the Nakayama and Rice LDNC special about encouraging graduate students and it's really, it's the best way because you're getting in on the ground then and you're setting up hopefully good habits for the rest of a researcher's life well, their whole life, definitely their whole life or just their research life at least and we can formalize our expectations as well and I think formalizing them in some way helps establish that these are important expectations so as one example at the University of Hawaii in order to be awarded your PhD now you cannot be awarded your PhD at Hawaii until your archive so you have to create archiving plans as part of your dissertation proposal and then you have to submit proof that your materials have been deposited with an archive before you're allowed to submit so University of Hawaii will not have anyone in the future who falls into the Will Archive category and the descriptive thesis must cite resolvable examples or resources what was that sorry so there must have some kind of citation that links it back to the underlying data so that if I was particularly interested in an example if I thought it had a kind of really interesting form I could go back and listen we can also encourage our colleagues the peer review process is one point at which I think it's okay to ask probing questions about the methodology that they've used and why they've used that methodology and also building these expectations into the funding and planning level for projects so actually mention your data management process in your proposal actually funding access to archives archives for a long time have been working on kind of their own funding but I think building funding in to ensure you have the time and the expertise to archive is another way that we can do that in order to leave you on a slightly more positive note today this is a quote from the Melbourne Linguistics in the pub they were having a conversation about grammar writing kind of the state of the art and I really like this because this conversation was happening almost exactly the same time we were presenting some of this work at the ICLDC documentation conference earlier in the year and so while we're sitting there and we're like look not a lot of people are citing examples these early career researchers in Melbourne were saying well we think it's kind of obvious it makes our work easier it makes our research much more transparent and relatable and comparable so to me it's heartening that perhaps we are actually experiencing a culture change and so what I feel when I look at the work my colleagues do is that everyone is already doing good work and so this is about giving ourselves the opportunity to think about what we want to do with that work and putting that into good practice into words so thanks still wedging just like one fieldwork photo into a data talk just because they were the only published books that we looked at because we are documentation researchers first and foremost well I mean they're doing different things and they're doing different things better or not better it was where we decided to draw the line because it's the genres that we have the most experience in and the most interest in but we we've made our methodology fully transparent and we've made our list of publications the publications will be in the published articles so we're happy for people to expand and replicate this with other genres and see what they find and I think it's important that we have these conversations at the larger linguistic community level but also within sub disciplines of linguistics computational linguists share kind of a lot of their stuff in places like Github which are kind of repositories where you can download someone else's kind of computational grammar and you can test it out on a corpus of yours which is a radically different way of considering reproducibility to say making a corpus of spoken language available online John, that's okay what you've looked at are people who give examples for example but that already is your data so you have made your data accessible so I'm going to think about a particular paper while I'm responding, unless you have something related I mean I think about a particular paper while I give this example I won't mention it and I'm sure you can think of other papers as well where you've read something and someone might say we have this phenomenon in this particular language here are some examples of it and here are all the other forms that are in complementary distribution with that and they may have missed some very there might be in your analysis this happens in evidentiality all the time you might decide that a language has a four way evidential distinction and another researcher has said oh there's only a three way distinction in this language if it might be that if you had access to say a text in that language you could check and see well it's interesting that they've said there's only a three way distinction because I can very clearly see this form in their corpus so I wonder what they're analyzing it as or it may be just to use an example from my own research that occasionally I go back and listen to things and realize I'd analyzed them really wrong and the only way I can go back and listen to something four years after I wrote about it is that I know where it is because of those citations so it's those other kind of specific examples that I think of where it's really beneficial to cite where the example comes from in the larger corpus but I think it's also a really good habit for your own ability to verify let alone worrying about whether other people are ever going to see your data so medicine is yeah yeah so the NSF grant just anecdotally is one of four and the other three were given to hard sciences linguistics was the only kind of humanities or social science project that was funded so the NSF project I talked about briefly is a two year project in which they're trying to create some kind of standard so other sciences are working on this too but to varying degrees and there are varying kind of industry specific problems so for example we know with medicine there are massive problems in terms of pharmaceutical companies not being that big on sharing negative results from trials which is a problem we don't I mean maybe we have we have a bit of confirmation bias if there's something one thing I noticed when people talk about for example the elicitation methods that they use or the stimuli that they use is that no one ever mentions the ones that don't work no one ever says you know what I use the put videos and people didn't know what was happening and all I got was just a bunch of confused speakers so I didn't use them so we have a bit of like positivity bias as well definitely yep I think there's also a qualitative dimension here which relates to what Candide's talking about I know of one particular example a part of a discourse was cited in a publication and when some follow up researchers went back to the actual tape recordings and listened to the tape recordings there was no continuous stretch in which that material was adjacent to each other there were interventions there were switches to another language and what the person had done was to extract the stuff they didn't want to have there and present it what they had so they should have I think sometimes we do this qualitative this qualitative way for good didactic or presentational reasons if you find a really nice example but there's stuff in it that's too complicated for the point at which you're developing the description or it's hesitation phenomena or it seems that you may not want to include then people will actually edit the what you're calling data which I think is actually a contestable term well yeah there's lots of discussions relationship between what actually appears in some form in a paper a grammar or document or whatever and what's actually on the recording may not be a direct one at all but not for bad purposes but actually done for you know for the didactic or other kinds of reasons so there's a lot of discussion at the moment about what counts as source data or primary data or raw data and linguistics and what those terms may actually mean across different subfields but I think your point is also important in that for example we had one grammar and I can't kind of remember which one I could probably find it one grammar in which the person was like I only worked with one speaker because I only had access to this one guy so we just worked together on this grammar and so these things were we kind of were all told that it's really bad to only work with one person but it's actually really commendable that they've been honest and transparent about that and I think sometimes people would prefer to admit this kind of information we have to get better at accepting that if we just put in a footnote actually this example had three embedded noun phrases that I've simplified so that I can fit this on the same half page or something yep so we didn't different subfields have their own traditions so it might be that your individual examples don't resolve very clearly but you presented a lot of rich information at the start or it might be that you have these very sciencey authoritative looking codes on every single sentence but if I don't know what archive they're in and I can never find that archive then it's so I wrote my what do you call the M filled conversion yes I'm getting the hang of the terminology upgrade that sounds even better I wrote my upgrade paper without examples cited particularly clearly and it was strongly recommended to me by Nick which will be of no surprise if any of you know him it was strongly recommended that maybe I think about ensuring that I had a good citation methodology going forward before it grew more than a hundred examples and I like I'm quite happy to say I couldn't find some of those examples again and I didn't have a very I only had about three months worth of field work to work with then I couldn't imagine coming back to you know 20 years worth of shoe boxes that I'd ignored for a decade I think I think that might be an over simplification of the discussions in language documentation there's definitely a lot of angst about how we identify people and whether we identify them and to whom we identify them which is why we get all these very elaborate access requirements for ELAR and part of why people are reluctant to open archives is this negotiation between you know sharing sharing recordings I don't think it's necessarily just about hoarding all the ergatives for yourself but yeah there are kind of there are cultural attitudes within linguistics but that also intersects with kind of attempting to be ethical researchers as well and while working within very very different genres well I have a an entire paper pondering the fact that I do have plenty of people who say oh yeah you can share this with everyone but if you don't have the internet then do you really comprehend what it means to share it with everyone these are thorny questions so it's just that you put in a huge amount of work which is always a working program as you said when you look back at some of the things you've done so related to that how can you I mean it's a question of acknowledging that your work is a working for example it's a lot of work you put in that you may want someone to well since I'm now the boss of all linguistics and universities um yeah like so let's just start by junking the overly heavy focus on journal publications as the only acknowledgement of research output um tie good corpus creation to things so one of the discussions that we had at the Deliman workshop mostly because it was NSF people from America was about formally recognizing corpus creation as part of a tenure application so tying good corpus creation and management to things like tenure to future grant accessibility to job promotion to research output there's a lot of stuff about altmetrics now but I don't think corpus creation necessarily falls into altmetrics what we need is a standardized way of citing so I think because we're not in the habit of citing other people's corpora we're not in the habit of recognizing when people were citing published papers it was very very easy for them to follow APA because that's what they've been using their whole life as researchers their entire life as children as well um we're getting to the point where I'm tired and overgeneralizing ridiculously people have been using something like a standard citation for books and journals so they can implement that very easily we have to get into the habit of thinking of corpora as citable objects within our written work as well and that can start with things like citing your own corpus in your so whenever I write about there'll be a footnote somewhere and it says all of these examples come from this corpus this corpus is gone it's accessible here if you go to the citation at the end of the paper it will tell you the day I retrieved it on so if in five years time I have some newer and fancier opinion about Yolmo Tone and someone says well in 2008 you said based on your own data that it was this it'll be like well that's because that was corpus data from 2008 getting used to having these kind of conversations makes it a lot easier and accepting that we're part of why I run linguistics in the pub is to accept that as researchers we are human and that researchers no scandal and that it's a practice that we need to kind of constantly be working on the LSA actually comes to motion to recognise this a number of years ago and one really nice thing is that the Delamone just announced a prize oh yeah I should plug the Delamone prize prize for corpora can you submit your corpora so you can play a review of it so Delamone have a competition aimed specifically at early career researchers if you have Delamone is the digital endangered languages and musics archive network so it incorporates archives like Ilar here Paridesek which is one of the bigger ones in Australia Kalayapua Hone which is the University of Hawaii and they all aim to meet certain expectations in terms of data management, data persistence and availability going forward archivists say really weird things like forever but I just like kind of going forward for it's kind of impermanence so yeah I think those kind of and something like Delamone is designed to take the anxiety out of archiving for you if your archive is part of the Delamone network it should be robust enough to do the job of archiving your linguistic data is it actually not so difficult I mean if you look at other fields like epigraphers are getting have a history of publishing their transcriptions, their epigraphic transcriptions without necessarily the analysis or taxonomous the people who do do research in biology there's a whole set of taxonomic collections and classification categorisation that lies behind that I think we also have a lot of anxieties about the scope of our work that are often unfounded in light of other disciplines so the synchotron in Melbourne which is like a small hedron collider generates over eight terabytes of data every few hours which is more than is currently an e-law for like a decade of work so when we think about kind of the scope of our data obviously there's a human scale to it that's important to recognise and is important but a lot of the kind of logistical challenges we're dealing with a lot of data management people think we're kind of cute we're dealing with such cute amounts of data because we're all struggling to deal with it individually instead of collectively which is a sort of also devil's advocate if I was a PhD student listening to what you said about Hawaii I would be really worried I'm really scared you've got four years here to do your PhD that includes first year of training and upgrade maybe a year of field work and then you're writing a I see Charlotte nodding her head here and you're supposed to also produce an archival material which is sensible well they're in the States so they actually have less time than here you can spend seven years doing a PhD you can but you can't afford to because you're living in Hawaii so I think they generally only focus on their dissertation for the last two and the thing I would say to that is that if you start this kind of thing early it is not actually as onerous as it sounds it's about starting those good habits before you go on your first field trip you should have if you don't have a file naming convention before you go on your first field trip I would be asking why you're going on your first field trip to be honest I think but we put outrageous expectations on our students and then we don't even expect that they're likely to get a job so what's another expectation and one that makes their research life easier if it's properly implemented so that is what I would say it's all about being lazy good data management is actually for the lazy if you don't like effort then you should do things like have consistent and retrievable file names okay shall we thank Laura for her presentation