 I need to thank, first of all, the ERC for, well, the whole six-year project of which this is a small piece. And then also, you'll notice we're not at SOAS, we're at the friend's house because we couldn't find a room at SOAS. So I'd also like to thank the Society of Friends for being our gracious hosts today. Otherwise, I'll just get started with my paper. All right. So I'm going to start a little bit philosophical, which is when we're doing historical linguistics, what is data? What's our relationship to data? So in order to simplify matters, I've taken an example not from the Burmese languages, but from Gothic. So the Gothic language is more or less only attested in one book, which is the Codex Argenteus, which since the 17th century has been in the library of Upsola University. So this I am calling an artificial primary source. So it's some one used speech, and that has somehow been recorded in a physical substance in a way that it exists today and can be consulted. But not everyone can necessarily go to Upsola. So sorry for the size here, but on the left we have the kind of traditional, prestige, locus classicus edition of the Gothic Bible. It's called the Gotische by Bibel, and it was by Wilhelm Streitberg between 1908 and 1910. So this I'm calling an editorial primary source. So it's a representation of the artificial primary source in a way that's more convenient for consultation, just by virtue of being in libraries, for instance, but also potentially because it's transliterated or has some scholarly apparatus that makes it more convenient to consult than the artificial primary source. And then the next level of sort of epistemological distance that I would like to suggest is what I call a secondary source. Here is an example, this is a book. It's a dictionary of biblical Gothic by Brian Regan. So the nice thing about Gothic is all the words in this dictionary are in this book, right? Because we only have one book for this language. And my conceptualization here is that a historical linguist in his work may find it more convenient to consult a secondary source, but in principle any claim that's being made can be traced back to the artifactual primary source. And a good historical linguist would, if using a piece of information for an argument would always trace it back at least to the editorial primary source so that he felt confident it wasn't a typo or something like that. So in the case of the Burmese languages, so this is the Burmese languages, they're spoken at the border between Burma and China, and there's sort of around eight or nine of them. Only Burmese has a written tradition, and so this kind of model I've presented for Gothic works very well for Burmese because they have a written tradition, mostly stone inscriptions from the 11th to 14th century in terms of old Burmese, has very poor philological resources, so that makes it quite inconvenient as a historical linguist to work with. But the other languages are all modern. So this tripartite scheme of artifactual primary source, editorial primary source, and secondary source, how does it map onto languages with no literary tradition? And what I would suggest there is something like a physical record of the speech, like a wax cylinder or a wave file would be the artifactual primary source, and then a transcription of that, for instance, in electronic format, would be an editorial primary source, and then an article, scholarly article or a book or something with an argument, a grammar or a dictionary would be a secondary source. And one thing that's very lamentable, I think, is linguists tend to skip the first two, and they go straight from their fieldwork to their publication without ever making available somewhere the artifactual or the editorial primary source. There's no artifactual or editorial primary sources for the Burmese languages, and even in Sino-Tibetan I would say there are only two languages that I feel kind of satisfactory coverage in this way. One is Japook and the other one is Yongningna. Both have been worked on by people at the CNRS in Paris who have put really hundreds of hours of records online, both with the recording and the transcription. So in our research we can't use primary sources because they don't exist, which is frustrating. So instead, this is the kind of sources we are working with, which are word lists, and we have word lists made by missionaries or diplomatic officers in British colonial times, word lists made by Chinese field workers from the 50s until now, and then very recently word lists from the 1980s until now by missionaries associated with the Summer Institute of Linguistics. The British ones tend to be organized alphabetically by the English definition, whereas the other two sources are organized according to some predefined list of meanings. So the Chinese, for instance, they always start with the word for sky. And those of you who know about the Chinese indigenous lexicographical tradition, that will seem familiar to them. Now onto etymological dictionaries. We're writing an etymological dictionary. So here we have a little entry from a German etymological dictionary with the point just being that it's hard to read. I mean you might find it hard to read because it's small. It's also hard to read because it's full of acronyms and bibliographic things, and it's not so well organized that it's machine readable. So it's both hard to read for a human being and impossible to read for a computer. So we have tried to model here the information in this entry in terms of the relationship between the Latin words, and this is fruct, which comes from fructus in Latin, and the Latin word goes back to bre in the European, which is what's the German word? It's the same thing, Braufen, yeah. So the same Indo-European form becomes Braufen through inheritance in German. So we've tried to model this here, and in our own efforts we're trying to have some explicit models so that this data is kind of both human and machine readable. And then in this next chart, let's say, if you're writing software or something like Wikipedia articles, every line of, let's call it code, that someone changes, it's version control, so at the microscopic level you can look at any moment what did the thing look like, and then over time you can have some sort of representation like this, where each color is a different user, and then you see the lines of code that are written by those different users, and it would be nice if something like an etymological dictionary we could represent in this way. So you knew who contributed which ideas, when, and watched the knowledge expand over time. But in fact, if someone writes a new etymological dictionary of a language like Latin, 90% of the information, probably 99% is old, but there's no explicit model of what's old, what's new, what's original, where are the disagreements. So that's something we would like to make more explicit. So our aim is to digitize all of the relevant secondary sources, but we're starting with this one, Huang 1992. Sorry, we have digitized them all, but now we need to do something with that, and we're starting with Huang 1992 because this one book has about 2,000, we call them concepts, they're just English words, no, they're Chinese words, translated into eight or so Burmese languages. So it's one source that gives you a kind of coverage of the whole thing, and then we'll supplement that with other sources. Yeah, so then talking about the specifics of our project, we have the source like this, it's a page of a book, we have it digitized if necessary by copying it out manually, but better to use OCR and maybe subcontract it to a company that specializes in this sort of thing, and then for each point we have three pieces of information, the form, the meaning, and the language. So I'll give the example, here we have the word Tian is sky, but the source says Tian, yeah, and it's pronounced Muk in this language, and the language I would call in English Maru, but in Chinese it's called Langsu. So we need to normalize that, which is to say make that comparable with other sources. So we have to say Tian means sky, where sky is not the normal English word sky, but it's some sort of arbitrary reference point that we will map other words, meaning sky to and other sources. We will normalize the phonetics, which in this case it's already written in the IPA, so it's quite easy, you just normalize it with itself. But for the missionary sources for instance, they might use CH for the Cha sound, where we need to normalize that into how that would be written in IPA, and then we need to normalize the language names so we know that Maru and Langsu are the same thing, and for that we use these standardized lists, like the ISO codes and the glottalog codes. So once we have that all normalized and in CSV files, we have a piece of software that my partner has made called the edictor, which is sort of a software platform for editing etymological dictionaries, and I won't talk now about the guts of it because that's about how we do our research and the methodology of historical linguistics, and it's not about data management per se. But basically if I can put it in the best possible light, we click a few buttons and then we get something like this where systematic correspondences between words in the different languages are just presented to you as a user, and then you can reconstruct the protoform, and then even the computer will allow you to test how predictive your protoforms are. That's the part we're working on right now, it doesn't quite exist yet. So now just jumping to the back of this, I have some screenshots. So this is the software, the edictor, where we have, this is the default page if you just open up the system, so we have the meaning here above and then because it's alphabetical, right, it's written by English, and then all, and these are the different languages, these are different codes for keeping track of the morphemes, this is the IPA representation, and then this is where the computer is keeping track of, it knows for instance that a T is a voiceless dental stop, and then the colors are associated with different classes of sounds. So that's the system, and then this is our GitHub page, which is where we have all the digitized primary sources, and we have a little checklist for each primary source, have we normalized the meanings, have we normalized the forms, have we normalized the language names. So all of our sort of workflow is happening on the GitHub, I would say one problem we have is we tend to keep reverting to communicating over email, rather than through the issue tracker on the GitHub page, and so that's if you like one point of data management that we're struggling with, but in principle then everything we're doing is here online on the GitHub page until we finish, if you like. When we write up this article for instance, we do it in this system, which is called Overleaf, it's a latex system, so here's where you write the latex code, and then you get a PDF reproduced in real time. So for me as a kind of latex idiot, this solves a lot of the problems of using latex where you have to have a compiler and you get an error message and it doesn't compile. In real time it's compiling all the time and you can see what your document looks like, and then also, Madison and I can be logged into the file at the same time, him and Germany, me and London, it's no problem. So it's a great environment for collaborative research paper writing. So that's one system, and then this is Zanotto, where once we are ready for publication there's a certain amount of data or a certain amount of code associated with that publication, which then at the time of submission, we file with Zanotto so that it's there forever, for everyone's enjoyment. And then also that Zanotto provides a standard way that you can cite this data deposit, it associates a DOI with it, then you can use in your bibliography. So those are the different systems we have, so just over, now summing up, we get the data, which is only secondary data because the primary sources don't exist. We digitize that, we normalize that, we put it in the edictor system, we switch it all about, then we publish some research on the basis of it, which we write in the overview system and we deposit the data then in Zanotto, and then the one thing in my summary I left out then is while we're switching it all about, that is all kept track of, at least in principle, on our GitHub page. So that's it, and if I have extra time... You are helping me at the time, yes. Okay, then I guess we can take some questions. I will also invite you to clap. I suppose the most obvious question is in the process of doing this, you are recording for posterity part of your sort of academic, personal workflow. Beyond the intellectual historians of the 22nd century, who will obviously be over the moon to discover in the rubble of London in memory stick containing a copy of this. What purpose does that serve? As opposed to what you're recording about the primary sources, what purpose does it serve usefully to think to have a record of the construction of your particular secondary source? So I think on the one hand, it serves no ultimate purpose. The submission to Zenodo is the important part in terms of this is the data that we used and this is the code that we applied it to. That's what's important. The stuff on the GitHub page of like, it was on April 22nd that we digitized this source, or something, it's not of any importance, I think once the project is over. However, I think it is important that we feel comfortable exposing that to the public. Because if we want to hide something, it means we're embarrassed by it. Like, oh, we're very far behind digitizing our sources. So I think it's a good discipline to say this is not my private life here, this is my public life and it's my job. So everything should be out in the open. Pressure on that? Sure. I think that an alternative when somebody might take is that while I'm perfectly happy in a sense to share how I work and explain to somebody that for every good idea I put down on a published page, there were 10 really stupid ones beforehand. I don't necessarily want to share the 10 really stupid ideas that led to the one good one. From my perspective, they won't end up in the publication and if there is a person out there who is every morning reading our GitHub to look for a stupid idea that then they'll somehow mention in a public forum, that person has some psychological issues and some time management issues that aren't really my concern. Yeah. Mandana, you go ahead. Well, I think it is key and important because this is basically what allows you to do archaeology easier than later on, right? It allows you to trace and this is actually what makes the research transparent because right now the research is not transparent, especially in linguistics where none of this is ever shared, right? We don't know what the source data is. We cannot test and check the source data. We have no idea what steps have been applied that need to be analyzed because already that, but you see that is part of the analysis. So I think the division between complete data management versus analysis is one that is not correct because the way you manage your data the way you divide your data up and chunk it and relate it to each other is already part of the analysis process and this then in 50 years time will allow you to have a check on and it will see, oh, see, on the 21st of February they went in the wrong direction. Yeah, I think that... I mean, I also put in terms of version controlling like on software Wikipedia, what's the point of that? Like hopefully most of the time no one is wasting their time doing this like slice-by-slice retrospective look at anything because it's not a good use of human time. But it is there and it's also well hidden, right? It's not in the way, but it's there if it becomes important and I do think that's useful, but I would say that if someone said, look, you have to make me choose between depositing my data in a public archive or having a version control system for my workflow I would definitely say, well, that's a false dichotomy but if you're going to make me choose go with the archiving of your data in a public repository. Just to come back to your point many of us have tripped over mistakes made by people in farms and without a forensic analysis of trial like we have here we don't always understand how the mistakes that the past came about. We don't understand about how they came about. We're quite happy to commit them again. So it's very useful to have a forensic trial as you say to see where someone may have rolled in the past so I think that's really helpful. You're receiving a get-up, two things. First is it's a collaborative platform. So if you're writing code with underneath a call you having kids and get-up as a way of merging your code together and ensuring that there is a master version of that code available that everyone agrees on it's very important and get-up does that extremely well. That's the first thing. The second thing is if people are using your code again they can go they may not want the final version they may think oh they've written some code here that's very useful that was written six months ago they can take the repository back to that stage and then build on top of the code that we use that for. So if your Nathan's research is being funded publicly I assume and therefore people should be able to do that. Yeah, I agree with all that. I think that those considerations are particularly important for software writing which is what get-up was designed to do. So one question is just sort of practically speaking would it not be better if there was some kind of version controlled etymological dictionary writing platform? Well that doesn't exist and get-up would we more or less find easy enough to use that it works for our purposes but actually Mattis just made me aware in the last couple days about something called the Open Science framework and they seem to have designed a similar kind of version control workflow system particularly with biologists and psychologists in mind which might be more user friendly than get-up in terms of using the command line potentially to push and pull things. So I do think that in the near future hopefully but it will take all of us to be on board with this agenda having version control systems that are more easily usable by disciplines other than computer science like epigraphy or religious studies would be a good development and I bet I really have only just barely looked at it but I bet this Open Science framework is somehow specifically geared to biologists and psychologists that that might be a step towards the sort of work we're doing but maybe more work needs to be done in that direction as well. I think that let's because then we'll be a tiny, tiny bit early. So please