 I need to thank, first of all, the ERC for the whole six-year project of which this is a small piece. And then also, you'll notice we're not at Soaz, we're at the friend's house because we couldn't find a room at Soaz. So I'd also like to thank the Society of Friends for being our gracious hosts today. Otherwise, I'll just get started with my paper. All right. So I'm going to start a little bit philosophical, which is when we're doing historical linguistics, what is data? What's our relationship to data? So in order to simplify matters, I've taken an example not from the Burmese languages, but from Gothic. So the Gothic language is more or less only attested in one book, which is the Codex Argentaeus, which since the 17th century has been in the Library of Upsala University. So this I am calling an Artifactual Primary Source. So it's some one used speech and that has somehow been recorded in a physical substance in a way that it exists today and can be consulted. But not everyone can necessarily go to Upsala. Sorry for the size here, but on the left we have the kind of traditional prestige, locus classicus edition of the Gothic Bible. It's called the Gotische by Bibel and it was by Wilhelm Streitberg between 1908 and 1910. So this I'm calling an editorial primary source. So it's a representation of the Artifactual Primary Source in a way that's more convenient for consultation just by virtue of being in libraries, for instance, but also potentially because it's transliterated or has some scholarly apparatus that makes it more convenient to consult than the Artifactual Primary Source. And then the next level of sort of epistemological distance that I would like to suggest is what I call a secondary source. Here is an example. This is a book. It's a dictionary of biblical Gothic by Brian Regan. So the nice thing about Gothic is all the words in this dictionary are in this book, right? Because we only have one book for this language. And my conceptualization here is that a historical linguist in his work may find it more convenient to consult a secondary source, but in principle any claim that's being made can be traced back to the Artifactual Primary Source. And a good historical linguist would, if using a piece of information for an argument would always trace it back at least to the editorial primary source so that he felt confident it wasn't a typo or something like that. So in the case of the Burmese languages, so this is the Burmese languages, they're spoken at the border between Burma and China and there's sort of around eight or nine of them. Only Burmese has a written tradition and so this kind of model I've presented for Gothic works very well for Burmese because they have a written tradition, mostly stone inscriptions from the 11th to 14th century in terms of old Burmese has very poor philological resources so that makes it quite inconvenient as a historical linguist to work with. But the other languages are all modern so this tripartite scheme of Artifactual Primary Source, editorial primary source and secondary source, how does it map onto languages with no literary tradition? And what I would suggest there is something like a physical record of the speech, like a wax cylinder or wave file would be the Artifactual Primary Source and then a transcription of that for instance in the electronic format would be an editorial primary source and then an article, a scholarly article or a book or something with an argument, a grammar or a dictionary would be a secondary source. And one thing that's very lamentable I think is linguists tend to skip the first two and they go straight from their fieldwork to their publication without ever making available somewhere the Artifactual or the editorial primary source. There's no Artifactual or editorial primary sources for the Burmese languages and even in Sino-Tibetan I would say there are only two languages that I feel have kind of satisfactory coverage in this way and one is Japook and the other one is Yongningna, both have been worked on by people at the CNRS in Paris who have put really hundreds of hours of records online, both with the recording and the transcription. So in our research we can't use primary sources because they don't exist, which is frustrating. So instead this is the kind of sources we are working with which are word lists and we have word lists made by missionaries or diplomatic officers in British colonial times, word lists made by Chinese field workers from the 50s until now and then very recently word lists from the sort of 1980s until now by missionaries associated with the Summer Institute of Linguistics. The British ones tend to be organized alphabetically by the English definition whereas the other two sources are organized according to some predefined list of meanings. So the Chinese for instance they always start with the word for sky and those of you who know about the Chinese indigenous lexicographical tradition that will seem familiar to them. Okay so now onto etymological dictionaries. We're writing an etymological dictionary. So here we have a little entry from a German etymological dictionary with the point just being that it's hard to read. I mean you might find it hard to read because it's small. But it's also hard to read because it's full of acronyms and bibliographic things so well organized that it's machine readable. So it's both hard to read for a human being and impossible to read for a computer. So we have tried to model here the information in this entry in terms of the relationship between the Latin words and this is frucht which comes from fructus in Latin and the Latin word goes back to bre in the European which is what's the German word? It's the same thing in Brouten. So the same Indo-European form becomes Brouten through inheritance in German. So we've tried to model this here and in our own efforts we're trying to have some explicit models so that this data is kind of both human and machine readable. And then in this next chart let's say if you're doing if you're writing software or something like Wikipedia articles every line of let's call it code that someone changes it's version control so at the microscopic level you can look at any moment what did the thing look like and then over time you can have some sort of representation like this where each color is a different user and then you see the lines of code that are written by those different users and it would be nice if something like an etymological dictionary we could represent in this way so you knew who contributed which ideas when and watch the knowledge sort of expand over time but in fact you know if someone writes a new dictionary an etymological dictionary of a language like Latin 90% of the information probably 99% is old but there's no explicit model of what's old, what's new, what's original where are the disagreements so that's something we would like to we would like to make more explicit so our aim is to digitize all of the relevant secondary sources but we're starting with this one, Huang 1992, oh sorry we have digitized them all but now we need to do something with that and we're starting with Huang 1992 because this one book has about 2,000 we call them concepts they're just English words they're Chinese words translated into 8 or so so it's one source that gives you a kind of coverage of the whole thing and then we'll supplement that with other sources so then talking about the specifics of our project we have the source like this, it's a page of a book we have it digitized if necessary by copying it out manually but better to use OCR and maybe subcontract it to a company that specializes in this sort of thing and then for each point we have 3 pieces of information the form, the meaning and the language so I'll give the example here we have the word Tian is sky but the source says Tian and it's pronounced Muk in this language and the language I would call in English Maru but in Chinese it's called Langsu so we need to normalize that which is to say make that comparable with other sources so we have to say Tian means sky where sky is not the normal English word sky but it's some sort of arbitrary reference point that we will map other words meaning sky too and other sources we will normalize the phonetics which in this case it's already written in the IPA so it's quite easy you just normalize it with itself but for the missionary sources for instance they might use CH for the Cha sound where we need to normalize that into how that would be written in IPA and then we need to normalize the language names so we know that Maru and Langsu are the same thing and for that we use these standardized lists like the ISO codes and the glottalog codes so once we have that all normalized in CSV files we have a piece of software that my partner has made called the Edictor which is a software platform for editing etymological dictionaries and I won't talk now about the guts of it because that's about how we do our research and the methodology of historical linguistics and it's not about data management per se but basically if I can put it in the best possible light we click a few buttons and then we get something like this where systematic correspondences between words in the different languages are just presented to you as a user and then you can reconstruct the protoform and then even the computer will allow you to test how predictive your protoforms are that's the part we're working on right now that doesn't quite exist yet so now just jumping to the back of this I have some screenshots so this is the software the Edictor where we have this is the default page if you just open up the system so we have the meaning here above because it's alphabetical by English and these are the different languages these are different codes for keeping track of the morphemes this is the IPA representation and then this is where the computer is keeping track of it knows for instance that a T is a voiceless dental stop and then the colors are associated with different classes of sounds so that's the system and then this is our GitHub page which is where we have all the digitized primary sources and we have a little checklist for each primary source have we normalized the meanings, have we normalized the forms, have we normalized the language names so all of our sort of workflow is happening on the GitHub I would say one problem we have is we tend to keep reverting to communicating over email rather than through the issue tracker on the GitHub page and so that's if you like one point of data management that we're struggling with but in principle then everything we're doing is here online on the GitHub page until we finish if you like when we write up this article for instance we do it in this system which is called Overleaf it's a latech system so here's where you write the latech code and then you get a PDF reproduced in real time so for me as a kind of latech idiot this solves a lot of the problems of using latech where you have to have a compiler and you get an error message and it doesn't compile but in real time it's compiling all the time and you can see what your document looks like and then also Modus and I can be logged into the file at the same time him in Germany, me in London, it's no problem so it's a great environment for collaborative research paper writing so that's one of the system and then this is Zinodo where once we are ready for publication there's a certain amount of data or a certain amount of code associated with that publication which then at the time of submission we file with Zinodo so that it's there forever for everyone's enjoyment and then also that Zinodo provides it's not here actually it's a little bit further down it provides a standard way that you can cite this data deposit it associates a DOI with it that then you can use in your bibliography so those are the different systems we have so just over now summing up we get the data which is only secondary data because the primary sources don't exist we digitize that we normalize that we put it in the edictor system we switch it all about then we publish some research on the basis of it which we write in the overview system and we deposit the data then in Zinodo and then the one thing in my summary I left out then is while we're switching it all about that is all kept track of at least in principle on our github page so that's it and if I have extra time you are happy at the time yes okay then I guess we can take some questions I will also invite you to clap suppose the most obvious question is in the process of doing this you are recording for posterity part of your sort of academic personal workflow beyond the intellectual historians of the 22nd century who will obviously be over the moon to discover in the rubble of London a memory stick containing a copy of this what purpose does that serve as opposed to what you're recording about the primary sources what purpose does it serve usefully to have a record of the construction of your particular secondary source so I think on the one hand it serves no ultimate purpose the submission to Zinodo is the important part in terms of this is the data that we used and this is the code that we applied it to that's what's important the stuff on the github page of like it was on April 22nd that we digitized this source it's not of any importance I think once the project is over however I think it is important that we feel comfortable exposing that to the public because if we want to hide something it means we're embarrassed like oh we're very far behind digitizing our sources so I think it's a good discipline to say this is not my private life here this is my public life and it's my job so everything should be out in the open can I press you on that? sure I think that an alternative when somebody might take is that while I'm perfectly happy in a sense to share how I work and explain to somebody that for every good idea I put down on a published page there were 10 really stupid ones beforehand I don't necessarily want to share the 10 really stupid ideas that led to the one good one from my perspective they won't end up in the publication and if there is a person out there who is every morning reading our GitHub to look for a stupid idea that then they'll somehow mention in a public forum that person has some psychological issues and some time management issues that aren't really my concern yeah, Mandana, you were ahead well I think it's the unimportant because this is basically what allows you to do archaeology easier than you're on it allows you to trace and this is actually what makes the research transparent because right now the research is not transparent especially in linguistics where none of this is ever shared we don't know what the source data is we cannot test and check the source data we have no idea what steps have been applied that leads to the analysis because already that but you see that is part of the analysis so I think the division between complete data management versus analysis is one that is not correct because the way you manage your data the way you divide your data up and chunk it and relate it to each other is already part of the analysis process and this then in 50 years time will allow you to have a check on and it will say oh see on the 21st of February yeah I think I mean I I would also put in terms of version controlling like on software Wikipedia what's the point of that like hopefully most of the time no one is wasting their time doing this like slice by slice retrospective look at anything because it's not a good use of human time but um it's but it is there and it's also well hidden right it's not in the way but it's there if it becomes important and I do think that's useful but I would say that if someone said look I you have to make me choose between depositing my data in a public archive or having a version control system for my workflow I would definitely say well that's a false dichotomy but if you're going to make me choose go with the archiving of your data in a public repository just to come back to your point many of us have tripped over mistakes made by people in farms and without a forensic analysis of trial like we have here we don't always understand how the mistakes that past came about we don't understand about how they came about we're quite lucky to commit them again so it's very useful to have a forensic trail as you say to see where someone went wrong in the past so I think that's really important to receive a get-up two things first is it's a collaborative platform so if you're going to include another vehicle you having gates and get-up as a way of merging your crew together and ensuring that there is a master version of that code available that everyone agrees on it's very important and get-up does that extremely well that's the first thing that people are using your code again they can go they may not want the final version they may think oh they've done they've written some code here that's very useful that was written six months ago they can take the repository back to that stage and then build on top of the code that we use that so if your Nathan's research is being funded therefore people should be able to do that yeah I agree with all that those considerations are particularly important for software writing which is what get-up was designed to do so one question is just sort of practically speaking would it not be better if there was some kind of version controlled etymological dictionary writing platform well that doesn't exist we more or less find easy enough to use that it works for our purposes but actually Mattis just made me aware in the last couple days about something called the open science framework and they seem to have designed a similar kind of version control workflow system particularly with biologists and psychologists in mind which might be more user friendly than get-up in terms of using the command line potentially to push and pull things so I do think that in the near future hopefully but it will take all of us to be on board with this agenda having version control systems that are more easily usable by disciplines other than computer science like epigraphy or religious studies would be a good development and I bet I really have only just barely looked at it but I bet this open science framework is somehow specifically geared to biologists and psychologists that that might be a step towards the sort of work we're doing but maybe more work needs to be done in that direction as well I think that let's yeah because then we'll be a tiny bit early so um please