 I'm a linguist and through different kinds of processes I became the head of a digital archive. Not that I'm an archivist or a librarian, it came to me as a matter of different kind of things and it's kind of a tricky thing to run. So let me tell you what this is. So the Endangered Languages Archive was created about 15 years ago at SOAS with funding from Arcadia, Rousing, Private Money and the funding ran out in 2014 and now it's taken over by the SOAS Library. This is an archive for language documentation records and it's responding to the situation worldwide that because of globalization climate change and urbanization the 7,000 languages that are spoken today are in danger and linguists predict that by the end of this very century half of them will have fallen silent. So this money that has come in is to give money to linguists all over the world to go into these places and to record the languages as they're still spoken today and to bring back this material into this digital archive and make it publicly accessible. So we give grants at about 1 to 1.45 million pounds every year to people all over the world. We have no restriction on host institution or on host on grantee nationality. So we give grants to the Ivory Coast, to Papua New Guinea, to wherever the application is successful from. And so what this means is that a lot of people who have no training in digital data management collect a lot of data that we then need to get into the digital archive. Make sure you're in the microphone. Hello. I think you can hear me. I usually speak quite loud. Anyway, so the background of framework that we support is the open research framework. The data should be shared. It should be cited. The research should be reproducible and transparent and the material should be publicly accessible. This is important to make that difference for us because these are people that record people talking about their lives, about political circumstances, about the histories. It's personal. It's private data. And we are in a legal very difficult situation because of privacy rights that we cannot abide by. So when we talk about publicly accessible, what we mean is to register users of our archive. Anyone can register, but it's not open access in the strict sense of open data. So how can you imagine this? So the way to think about it is this is Sophie Salsner, who is a wonderful colleague of ours. She gets a grant from us, or HRC or Levyube. She goes to the field with audio and video recorders. And she records people talking. She records stories, rituals. Then she transcribes and translates what they say. She analyzes the sounds of the language. She writes a dissertation, gets a PhD. She publishes a paper about it, and then she becomes a professor in Nigeria or a poet in Germany, a boat builder in Spain, a human rights activist in China. So what happens to the materials? So this is the process that we are trying to handle, namely they're collecting a lot of material, primary and secondary material, and this needs to go into our digital archive so that we can make this available to linguists and to researchers worldwide, but also to the community members that have engaged in these projects. So what constitutes data in our project? Video recordings, audio recordings, photographs, annotations, or transcriptions, translation, morphological analysis, lexical databases, text files, and metadata files. And I've added the formats behind it, namely the formats that we accept, which doesn't mean that these are the formats that we get from our depositors. The key part is that the linguists that are conducting this fieldwork are not trained in the creation of digital collections. They're not trained in data management and in recording and record keeping. Right? So they have a mess on the computers, which is like, you know, Econ 01 story told by, you know, and Gemini, number one, number two, number five, final, final. And so this is the material that we deal with. Our data is stored, hopefully, on hard drives and drop boxes, on local servers and in offsite data center. And it gets transferred to us. We have, we only accept archive-friendly formats and before Wave, JPEG, TIF, and XML, which means that we also have to train our people in creating these kind of formats. And if you look at, I've just listed the basic software that our grantees are using, which is, for example, the CyndiMaker, which is a metadata editor tool developed at the University of Cologne, Arbil and Lamos, which is the software system we're using developed at the Max Planck Institute for Cycle Linguistics in Nijmegen. For conversion, handbrake, audacity, Arbidemux, TS, Moxer, we only train and use non-proprietary software. And then for the analysis, Elon, Toolbox, Flex, Prod, and Pima, for example. So if you think about these are linguists that are trained in linguistic analysis, this is all the stuff that they need to get used to work with, right, to convert a video file into these formats, to fuse them together, to then turn them into MP4 so they can feed them into our annotation tool, Elon, and do their analysis. And then, of course, we need the Windows version, the Linux version, and the Mac version. These tools, Flex and Toolbox, for example, are created by SIL, the Summer Institute for Linguistics. And they decided 10 years ago that they're not doing anything on Max. So there's only Windows versions out there for these things. So this is the material. So we're trying to train the linguists to create the collection out of their files that they've been recording and to put together all the materials in a way that we can upload it or they can self-upload it to our digital archive. Now, if they don't upload it, we upload it, which means that we have a digital archivist and three to four archive assistants. Then we have a MySQL database that we use to keep track about who's working on what and what has been submitted. And then we have a folder structure with a workflow of instructions, namely, first you ingest it here, then you create it, and so on. So how's the data shared? Don't you love these kinds of files? Yes, so someone sends us material, someone sends us a stick, a hard drive, and we put it all into ingestion. Then we move it into curation, and then we check the file names, and then we have to change all the file names because they have spaces, eggs, characters, and whatnot. Once we have those, we have to check if the metadata, if every record has a metadata file that will tell you where it was recorded, when it was recorded, what the content is, what the tags are, because we have a faceted browser system, so it has to have keywords and tags to be discoverable. Once it's done, it goes into curation too, and then when it's done there, it gets uploaded into our system online, and it's discovered to a discovery layer that is plugged onto our data management system. All of that is happening on SOA servers. We have a backup system in ULCC, which is really not the ideal thing because it's in a basement, and the backup server is sitting next to the main server, so if we have a flood, we're pretty much screwed. We were able to convince the Claren forum to help us with off-site backup and to mirror everything into the data center of the Max Planckersellschaft in Garsingen. This is Claren-driven. It's a EU project. Given the Brexit situation, we have no idea how this is going to happen because right now, we're mirroring everything once a week into that big data center. We have no idea how this is going to be continued and maintained. So what we have is basically, we have LAT, which is the Language Archiving Technology. What is Omica for you, for example, which was developed by the Max Planck Institute for Cycle Linguistics when they had a similar project funded by the Volkswagen Foundation. This has become, over the past, your legacy system, and it will. When you look at it, when it displays the data, it looks like this. So basically, if users go in there, and it can be any kind of user, they have no clue what this is. They can't find things because it's a Windows-based clickable system where you then have displayed what the metadata for this file is. And since we have a flat folder structure, we have no hierarchies. You click on one, and then it goes like, 500 files down. Now, because the interface was so difficult, we thought we need to do something else about it. And we had ViewFind, which is the discovery layer of the SOAS library system, to grab into our database and then display the materials that is in there in this way. So the basis underneath is the same, but it's discovered in this way. So what you can see here, or not, is that you have on the left side, you have the languages in the recordings listed. You have the type of file, audio, video, whatever it is. And then you have the genre and the participants. So you can discover, for example, the recordings by a particular person, and then listen to them or watch them. This is the back-end entry point where if you deposit the materials, we'll open a workspace for you. You get your note for your collection, and then you can upload your materials all by yourself, which is, of course, always a problem because the server is down, Lamos crashes, Java doesn't work, the browser has a new update, and it doesn't work anymore. So we're constantly like, how do you call the people that put fires out? Yes. Yes, that's what we're doing at all times. Now, our legacy system that we're working with is run by the Max Planck Institute, and we have a student nicely in Cologne who we then call and say, can you please fix this? Because we don't have anyone at SOAS to manage the system. But if we have problems with UFIND, then that is managed by SOAS. So here you have, for example, a landing page of our system where you can see this is the project, the documentation of modern South Arabian languages, Batari, by Miranda Morris, location Oman, and there's a summary of how to acknowledge the materials. This is where it was collected, and here's the kinds of materials that you have. So this makes the data available to the public and to other researchers, so you can find it through this interface. Now, if you think about it, that the logic that we are applying is we're documenting these languages so that researchers worldwide can share their data and can work with each other. The way this is constructed is as an academic construct, language as an academic construct, because if, for example, someone in Mexico whose material was collected in Chatino wants to find the recordings of their own grandfather, they can't because this is all in English, it's left to right based and it's heavily text-based. So the way our interface is structured is actually stopping people from working with their own materials. Another challenge that we're facing because we have the hope to make it available to not only the academics, but also to the public and also to the communities themselves. So in principle, what we have is that we have the linguist that records the primary material, creates the metadata, determines access. So determines whether or not when you go on a page, you can listen and see the materials or whether there's access restrictions because there's any kind of sensitivity. It could be a women's language that only women are allowed to listen to. So if you want to see that material, you have to request access from the depositor. They have to convert the materials, they have to organize them and back them up and then they send it over to us. What we do is we train them, we curate things, we fix things constantly and we upload it, we back things up and maintain the system. So this is basically what we do. And to make possible that a user can discover access, understand use inside the materials. So this is basically what we're trying to do and where we stand right in between those two sides. And now comes one of the nicest points in Nathan's question, which is, how will your data be archived for posterity? Well, we are now have become part of the National Research Library at SOAS. Before that, we were like our own small unit based on private funding. So now we have been integrated into the research library at SOAS, where we hope to have a guarantee that this material will be kept available. However, it's the institutional responsibility and given the financial crunch on the institutions, the problem is of course, if this is not used in teaching, why should we invest continuous money to maintain the systems that are running the whole thing? That's the scary part. And this is the end of what I have said. And we have about 80 ongoing grants during the year, which means we get 80 deposits annually. And part of it, the majority we hope was uploaded themselves and the rest is managed by us. Yes? Just out of curiosity, it may not be quite intended. Why did you treat JPEG as a suitable format for... Uh, I don't know. Oh, okay. We might convert it into this. Same way we get some word processing documents and then turn them into PSAs. Right. View Find, which is... Yeah. So View Find is the discovery that's used for the whole SOAS library. So if I didn't go to the main SOAS library page and type in something like South Arabian languages, will that bring me to a record for the submission you were showing? Thank you. No. Guess what? So this is another thing. This is about the politics behind the institutional funding, right? So what it will do, it will discover everything in all the digital collections but in ELAR, because we were in version three and then they updated everything else to version four, but not us. Yes, yes. Okay. That's it. That's a longer discussion. Okay, yeah, okay. Alcohol. Yeah, maybe. Yes. Um, this may sound like a very strange question and maybe there's not a way you can deal with it, but you mentioned that when there are restrictions on access, one needs permission from a depositor. Well, I have no connection with such project like this, but at least it used to be the case that Japanese libraries, Japanese archives, hold manuscripts. If you wanted to see them, you needed to ask the person who had donated the manuscript. Usually they were dead, right? And the response then was, well, you're out of luck. Yeah. And... You tell me. This is just crazy, right? Yes, so the situation with this is because it's a young archive and the majority of the materials that has been collected is new material, right? This is not legacy materials. Is that the depositor to collect up the materials is alive and by licensing agreement they have to give us the name of the delegate. And if we can't reach the depositor or the delegate, we have a sunset clause. The materials goes over into our management and then we can determine if we can see it. The issue isn't over yet. Yes, I am. You... Well, the thing is we don't... So what we're not facing is this whole copyright issue, right? Because it's not written material. These are all unwritten languages, right? We don't have scripts or a text or anything for them. This is all broken languages. But you did raise that you are facing a privacy issue. Yes. If an individual subsequently approached you and said the recording that was taken by this anthropologist wasn't really appropriate, right? There are privacy laws in my country, I wish you to avail them. You would be obligated to... We'll take them down. We'll make the data. Absolutely. Now, here's a question. How much of a legacy issue is that for you? Would you feel obligated to remove the data if a delegate subsequently came to you and said, I don't want my grandfather's material still being on this database? Yes. Right. So the data has a risk associated with it? Yes. That it could be removed for centuries to come? Absolutely. Right. So because we're working with communities all over the world that have different cultural restrictions. So for example, in Australia, in many of the Indigenous communities, if a person dies, you cannot show their image anymore. You cannot show their faces, right? So if there's any photograph of this person or any video recording of this person, this material will be restricted and accessed for the time being. So what happened is that when 15 years ago, this all was created, we had no idea what we were doing, right? We were just learning along the lines what the digital world actually means. And we honestly don't even know what the whole legacy, legislation actually means because this might have been collected in Papua New Guinea by a French researcher and it's sitting in the UK. So we're, I mean, it's all a gray zone and we're trying to fly under the radar here because we don't really know what it all means and what comes with it. However, when you deposit with us, you have to sign off a licensing agreement and that licensing agreement takes us out of any responsibility if anything happens because it's also, so the researcher that collects the data, right, has a legal system within their own host institution that they have to abide by. Now the interesting thing is, because we're funding worldwide, the Anglo-Saxon systems are very restrictive, right? So they have to go through IRB control in the West. You have ethics control in the UK, you have it in Australia. The French have nothing. The Germans don't have much. So if you talk to a German personnel like what, what's the problem? Now what is interesting about this is that the researchers are not used to think about access and the fact that they are creating a digital collection, they're going in there to write a dissertation about a topic, to write a grammar. They don't think about the fact that they're putting this on the web and the entire world can see this and once they're putting it in, they're suddenly realizing, oh my God, anyone can see my messy transcription. Oh my God, and then it comes to paranoia that, oh my God, if I put this all up there, then someone could scoop me. Which is, of course, a bit bizarre because you're probably the only person who's ever been to that village and it's the only person who can manage to go through this data sometimes. But everything comes at once, right? Digital access, shareable, open access, open data. And so we're trying to manage this in a really, really careful way with the take-down policy and with having it delegate to manage this. But if you have, for example, North American materials, so materials from North American and Canadian drugs, that's always completely restricted. They don't use the materials themselves, they don't share the materials. And the same holds for Aboriginal materials, Africa, Brazil, really problematic. And you can now see over the past two, three years the situation is changing, for example, in Australia, because the elders that survived genocide whose children were taken away and are burnt by white people, are now saying, actually, no, put it all up because my grandchildren don't speak the language anymore. And actually, their grandchildren are putting everything on Facebook anyways. Right? So you can really see how the development is changing and accepting a digital art to make this material available. I mean, I'll mention the reason that was in the back of my head because I thought that the British Museum collected various artefacts that were bodily remains in the 19th century and the late 18th century under what it presumed to be good practice at the time, with the intention that these were all parts of an ongoing election. And of course, about 10, 15 years ago, people in hospitals in the UK treated the remains of children very badly and the UK changed its laws for that. And as a result, that material left the institution of the British Museum. So it wasn't, this wasn't an effect of somebody somewhere just deciding that they didn't particularly like their grandfather, this was a completely unintended consequence of a change in the war some 200 years after the initial collection. And that was why I was raising it, that these kinds of possibilities that your data is effective may not very robust for that, that these primary source materials could kind of vanish at some point in the future. But was one, I was... Yeah, that's one of the things. But anyone who deposits with us has to have consent from the people that they're required with. One place where we observe the changes in legislation right in our faces is nudity and child pornography. Now imagine, you know, this is a village, like a, you know, this is a sign language village. This is, you know, they have hereditary deafness, of 2%, everyone speaks a village sign language and there's recordings of people, you know, signing away and speaking away and there's 500 naked children running around. I have to restrict that, right? Because of the Americans. And actually, what's the probability that someone will go now and go into the archive, register and then find child nudity of little children running around and do we need to now go through every single recording and look if there's, you know, a bare breast or a naked child? So we're in that realm too and it was one of the discussions we had with what are we gonna do with this stuff? Any other questions? Are you not, are you, when you're talking about consent in the original interview, are you also talking about copyright? Is there any direction of copyright in terms of the restriction or are you just talking about more like access restriction? Access. Okay. Access. Nothing else. So there's no copyright consideration? No, no, no. It becomes tricky when we have songs, right? When it becomes music. Then it becomes really, yeah, exactly. And these are artists. So if we have for example song set collection from Aboriginal ethnic music, Aboriginal tribe with ethnic music colleges, that becomes really, really tricky. But when we talk about consent, it's also a bizarre concept of, you know, how do you get consent from someone who just has never seen a computer? And an internet. Yeah, here I can send that this is. So what we have to have, or the depositor has to have this consent based by the university. And there was legal bizarre situation. So we funded, we fund someone from Berkeley and the research officer was very zealous. They had to sign the licensing agreement and said, we cannot sign that licensing agreement. The researcher is a PhD student, cannot sign the licensing agreement. We, the university owns the material. And I said, if you tell the Brazilian tribe or this North American tribe that you Berkeley own this material, there will not be any more linguistic research because of the whole political unionism discourse. So that's when I said, okay, take the licensing agreement out of the contract and just send it to the PhD student themselves to sign it and we're gonna be fine. So we have to manage all these legal problematic situations because no one knows what's actually happening and there's a lot of bizarre kind of conflicts that are coming up with these kinds of, with this type of material. And music is tricky because that's copyright and that's, so when you sign up for us in our archives, you have, so this is why we call this public access. You have to sign up and when you sign up, you abide by our creative comments, whatever, you know, non-commercial, only educational use, blah, blah, blah, blah. And we allow people, if you want, we would watermark the materials just to, yes. On a different topic, I was just wondering, when it comes to thinking about these processes of data management, file formats and you know, the integrity of those, are there any conversations that are had between your projects and other projects doing similar work, i.e. curating, preserving the endangered language in the world of literature? Or are these processes sort of thought up within these individuals? So there was a, we found it, we did, we have an organization called DELAMON, which is for Endangered Languages Archive. And it's really interesting because all the Endangered Languages Archives, the archivists are all linguists who are archivists. And so we're trying to support each other in what formats are related, workflows for teaching materials and whatnot. But each of the archives sits in a particular institutional department in a financial situation that determines what they can do and what they cannot do. But when it comes to those formats and whatnot, we look into those practices that are generally agreed upon the archiving and library institutions about formats. So we follow those guidelines on the bigger scale and then we help each other on the smaller scale what kind of enterprise or software can be used to meet use the files that come out of a consumer camera and all these kind of things. And right now we're trying, for example, to invest together into an updated metadata editor. So they have a very simple interface because we hate Excel sheets. Too much freedom. Okay, I think it's time to thank DELAMON for this very good presentation.