 Research Leadership Award project, a million pounds for this project investing the unexplored side of multilingualism and I have to confess that I feel like the total interloper here because A. I don't work on Asia or in Asia, B. I don't work on written texts at all but exclusively on speech in this project at least and three, we do not have data in one language but we look at multilingualism in three village communities in this area, in southern San Diego so here you have the Atlantic coast of West Africa, Gambia here, no speeches and we are here and here with the red dotted line you see our main field area three villages and my personal field side is there and this is what we have been doing for almost four years now, we are in our fifth year so we work in these three village communities and we call the crossroads and we look at the languages that are nominally associated with these villages but also at how people actually speak and they use an astonishing amount of languages so we have three senior researchers here and then myself a little bit outside and another researcher in San Diego on control languages and then we have six PhD students we used to one dropped out they also look at the same area in the same area they look at language use in social networks and women, children, code mixing, dikesis and other topics that I will not tell you anything about today but rather I'm going to morph myself into our workflow manager Rachel Watson who put together the book of what I'm going to tell you today and then me how our data management works and so you may already have gained the impression that this is a project that involves many people and so we have the researchers here and some of them are based in London some of them are based in Senegal some are actually in the fields some are 450 kilometers away so that brings with it communication problems and we have our research participants so the people whose language use we observe and record and then we have a local transcriber team that have been people who have been trained to transcribe our speech data I recorded an audio and video and then some people have more than one function so you see here some lines because we have two corpus managers and it speaks for Rachel's self-effacing modesty that she actually omitted herself and her central role in the project namely that of workflow manager and she also omitted a very central person in our field base in rural Senegal our transcription manager and this is a role she created and in what follows I'll tell you a little bit about how these roles came to be created and what the functions are and why we need them because as you may guess from this slide it's all going to be a lot about the human factor in data management so just a little bit about the type of data we have so we have linguistic allocations you know we kind of torch away sit a consultant in front of a mic and ask them paradigms we have interviews we have psycholinguistic experiments we have so-called state communicative events we do ethnographic and social linguistic participant observation and we have recorded field notes or our brains we have data where we leave the camera in a household for instance for an afternoon and just record all the interactions and then later come and collect the camera and you know take whatever is on the card we have our lavalier mic data which we fondly like to call kgb data where we observe the interactional the linguistic interactions of six focal agents to per village so they get a clip on mic that records the entire interactions when they choose to keep the mic switched on for an entire day and for each of these six focal agents we have collected two days worth of language use and we have photos we're doing really badly with photos I will not talk about photos at all and yeah questionnaire social linguistic interviews so very complex diverse set of data collected by many different people and here I'm only going to talk about what our corpus data are so so far this is the state of as of February this year roughly 100 hours of transcribed and translated speech data all tagged for participants and language as identified by our transcribers you can ask me about that in the question time so far we have identified 211 participants and our transcribers have identified 20 languages spoken by people living in these three villages so just to give you an idea how it looks like so this is the percentage of languages that are spoken in our corpus so far so you see these are these three portions of the language is nominally associated with these three villages then you have wallop which is the major national lingua franca of senegal here and oops and then you have french which is the official language of senegal here then you get all kinds of other languages here you can only see 13 the others you know they don't come out because they're fractional if you look at it per village and the language that is associated with deep on care is here and you see all kinds of other things this is only uh lavalier mic data okay so day interactional data uh all kinds of other languages being used all the time next village next door looks really different this is the patrimonial language it's by far not the most widely spoken language in interaction and then in the other one we get the inverse distribution actually this is the patrimonial language associated with that village and it's the one that is used to the majority of contexts and then the others come in only in a minority so what do we do with these data um yesterday you kind of had the uh end result of the process of what we are doing with our data when mandana said the table gave a talk on data management and the internet languages archive so we use the tools that are also used by the internet languages archive so here you see arbol which is a corpus management tool and uh we create for every single uh recording we create a session which is a resource bundle that consists of the media file all annotations translations and uh metadata crucially and so here you can see our session here actually you can do this here this is the name of the session and all associated files will have an identical name um and we have very strict conventions for file naming that i can explain to you in detail later if you're interested and then we have uh actors that are defined um so you can just drag and drop them to your session if you know them already if not it's a again a regulated process to add actors and we type for languages and for all kinds of other things but this is how ideally all media files should be managed in our local corpus and then uploaded to an online archive and this is how the data we mainly work with the corpus data look like in terms of um what information we have on what we work with so this is uh a screenshot of a software called Elan which is a multimedia annotation tool that our transcribers use to transcribe and translate and then identify participants and type for languages yeah so you have the original transcription the translation into french orthography is our transcribers we don't interfere with that at all at this level until you see so in this small stretch you already have one two three languages okay so and then um we have an intermediary step before we can actually have all our files in our online corpus we also have a kind of working corpus on two different network drives that are regularly um back up and you can see our file structure there that's the intermediary one where we everybody knows uh also where to find uh things um and this is a how we look in the archive so far so good and so easy right because it's dead easy we all know data management is straightforward we collect the data you create the metadata annotations you deposit in the corpus you analyze and the end well in our case uh very often uh we don't know the people we record because we record spontaneous interactions we don't just sit people who we know in front of a mic so very often it's impossible for us to identify the participants especially with the laverie mic data when people walk around in our absence and very often the transcribers don't know all the people so it's actually quite a time intensive intelligence governing process sometimes to find out who is speaking and we cannot identify everybody but of course we are going to try very hard because we are trying to understand multilingual language use in interaction language as a social practice so we need to know a lot about the participants very often we don't know the places we cannot identify a fraction of the languages you know and not even our transcribers can transcribe or understand all of them um for the metadata annotations we early on in the project we really hid a dead end that was terrible because all these tools are designed to work online and we had planned for internet access in our field base and you know uh Senegal telecom never laid the cable that would have allowed us to have internet access so all our work in uh Senegal is offline that creates a considerable time lag between different steps in the workflow and we all know what happens if you have a lag you know things fall through the cracks and it makes the interfaces less smooth and finally you know we are 12 researchers plus five transcribers um all our collaboration is based on a certain degree of standardization harmonization unless that is created you know everybody can only work with their own data and we cannot actually create a corpus that is usable for everybody and accessible for people who don't know uh our particular uh environment so these were the challenges we were confronted with and uh now i'm going to tell you a little bit um how we confronted them so this is kind of how it was initially um so we had the researchers and they collected and we know researchers really like to collect data a lot and they also like to give them to other people to work with them so researchers would give files to transcribe to the transcribers in the base and we already had a kind of somebody in an administrative uh position but he was not explicitly responsible for managing the transcription uh pipeline okay so people would sidestep him you know many of the transcribers are kind of nominally attached to some researchers who had worked there previously and everybody of course always wants to have their own data transcribed right because you know you always need them next week and so we had these constant problems our transcribers were saying we need a pay rise we're working overtime you know because people give us these files to transcribe and we kept saying but we have a plan what's going wrong um similarly when the files were transcribed we it was very difficult to track where in the pipeline file was because sometimes the transcribers would give it back to our administrator sometimes they would give them back to the transcribers on a memory stick on a hard drive you know uh as a watch out attachment when they had a 3g network sometimes files vanished nobody knew which one was the good version so it was a big mess and similarly with the corpus because people came back to london and kind of would of course always know where their files were but somehow our beautifully designed plans you know at the beginning about creating this overarching architecture never really fell in place and so we decided to have a workflow and a workflow master and from my personal perspective actually that was maybe the most important experience that i had in leading this whole project was that it's really important to find good people to actually fulfill particular roles and so you know this was a little bit self-selective as well um because you need to have a certain mindset you know and great team spirit and sense of responsibility to take on such a role so so luckily we had rachel and she got really interested in principles of workflow design and what she did was really nice because she started looking at the points where we encountered problems and instead of you know just sending out these emails you know we have a policy please do this this this this that make people feel defensive and stop reading these emails she really started investigating okay what why do certain things not happen at certain times and then she started reading up on workflow designs which take this human factor into account and came up with these principles so let the data management flow be dictated by the natural cycle of research and future it's not the other way around so given that our initial workflow that is also the one that is always recommended in data and language documentation was too simple and did not you know do justice to our complex time yeah chronotopes basically um you know we needed one that took account of that and then where a task must be carried out in a very specific way by multiple team members we need to have documents detailing the process in explicit detail but also if something does not need to spelled out in detail that then don't be prescriptive then that people just do what they want to do and finally try to approach and task according to knowledge and expertise but also you know some things just need to be done and there's no shying away from it and that needs to be enforced and finally you always need somebody who is looking after these things and oversees the entire process and so this is then how our whole overall workflow was simplified and strategized so the researchers rather than going to individual transcribers had to go through our transcription master in our field base who has a master list and who's the only one who can give files to transcribe to transcribers and is the only one who gets them back and transfers them to the researchers then he wants oh yes not to the researchers exactly that's also important so you know this again this chaos you know where are my files and also files can belong to more than one people has been eliminated because he only talks to the corpus master in London who has a spreadsheet and she logs all the stages and also what is missing and needs to be done and she then talks to the researchers something's missing or gives them transcription so that they can work with them and unify session bundles that have something missing etc and then she oversees that everything is uploaded into the corpus with all the elements that are required so this is the spreadsheet she has so you can see here the names for these sessions and then this does it has it doesn't have any a transcript in an annotation file in a line does it have a WAF file does it have an mp4 file is it transcribed participants really important you see we have 211 participants how do we recognize them while we have a participant master list where every participant gets a unique code and again from a participant master so wherever somebody encounters a participant that is not yet known in our other favorites they have to go to her and she assigns the code etc is the language identified um and indeed that means is there a metadata file so um i can actually pass these around so these are some explicit workflows for you to circulate and have a look at you can also see them on our website which is source crossroads orc and Rachel has written a blog post on the workflow and you can download all these there if you want here i'm only going to show you some examples for particular you know critical points and what is specified there not also that these workflows they do not they're not technical help files we have separate technical help files like if you don't know how to convert um an uncompressed audio file into an mp3 or an mp4 an uncompressed media file into an mp4 we have a file with screenshots and step-by-step instructions for that so we try not to overload these so every single thing is broken down into as simple as possible steps so this is kind of the first step when you have a recording that you need to create um a session bundle for in in our bill and uh so what happens here is okay you we have a set of minimal uh metadata that we need to know so you need to have the name you need to know where it's been recorded when by whom a lot of this already present in the name which is very memonic um but you also need to know who is speaking and ideally you want to have the languages so okay so this can be nicely broken down do you know the participants yes okay then go to this step okay do they have a code and then you decide yes or no okay if they don't have a code so if you how do you know you check this document and it tells you where this document is located as well on our shared drive but here this is the crucial step you know this is the thing where something is not done and with a particular point and if then you don't take care of doing it at a later point when you have the necessary information it just never happens so this is okay do you know the participant no then the next question for you to consider is why not first of all not not by researcher well that's your point you go back and do it okay but if it's from the you know the uh lavalier mic data where you then you cannot know them right then you wait for the transcription and you do it then and for that we then have our corpus master who has the spreadsheet and sees that this information is missing to remind you another short example from the second workflow sheet where you prepare files for transcription again to give an example so here it says prepare prepare files for transcription okay so the first step you do is you convert all your files to WAF and m mp4 so that the transcribers do not need to work with these huge uncompressed files and if you don't know that we have a sheet you know to help you but it's not on here and then you go on and again spoken down step by step another example this is so you have a recording you have created the first metadata and now you have a transcription right so the transcribers have transcribed and identified the participant languages in elan okay and so you get this and you check what happened so you open the file you check the participants in the annotation and you see are there gaps there are always gaps okay so why so this on the other case that they're all identified never happens uh so no well email the transcription master and attach the elan file and request the identification of participants note that this is also something we had to hone regarding file transfer because we don't have internet in our area but what we have we have a good mobile network 3d network so we cannot transfer media files at all or upload them but the transcription files are very small you know the size of text files so we can transfer these in what's up so that is actually very helpful so you know we can actually send a transcription file to our cops to our transcription master in our field base and they can check and they are there you know um they they can ask even the people who participated in the recording to identify the innocent participants okay how much time do i have left okay and um so this is just to give you an example where you know this is a point where you know there is you cannot ask anybody else you know this is something that when you get to this point you need to do something but this is when you move files into the proper so there are other things that i'm not going to show you here but if you get to the point where you have to bring together all your resources into that resource bundle and you need to make sure that everything is there and that thing is missing and there is something missing like the relevant media files which happens a lot well then there's nothing to be done and you cannot delay it you must go and find them i think what i wanted to tell you is how important it actually is to take the human factor into account when you plan and manage such a distributed project and um also that you know it would be really nice if you had a better system for instance for managing metadata you know i was looking in awe at the c-text program and other programs projects that have so much automated uh so many automated steps in their workflow clearly we could do better uh with that um what i wasn't talking about is how we can integrate social linguistic data that is something we really haven't looked at in our project so that's part of the individual research project and everybody manages their individual data and we have not even attempted you know to enrich our corpus with that that would be an important next step for us and yeah automation of certain processes and reliable online access and maybe something that is under development in many language documentation projects if you can't have really online access then try to have an app that helps you to uh collect immediate metadata and maybe even one that doesn't you know that automatically uh uh collects GIS data so that you do not have to collect that separately but that's really the next step so for now i just thank you um listening to it what it sounds like is that the problems you're trying to solve are not really about digital anything they're about the nature of how you handle teams right they're about understanding what people's possibilities are and how people respond to their roles within teams so they apply even if you were in a pre-digital age even if you've been recording these on yeah yeah but i think 90 of problems in digital data management are not about the digital at all there are loads of problems about the digital that i did not even touch upon because i hope somebody else will take care of them like you know the regular migration of our data formats for instance you know i mean this is we have funding for five years you know this nobody will look after that if the archive uh that cannot find a good way of doing that but yeah as PR on the project we're on these sort of issues could have been looked at earlier than they were and uh i say that to john watch out this is i hope you are paying attention do you think that now do you think that i'm paying now i'm not saying you're paying this no i'm not saying that i'm saying that i'm saying you will i'm saying you're adding you know the next three months or something all the indications i knew will make those mistakes okay but i'm sure that you tinker with the workflow constantly constantly it really needs looking after like a potted plant yeah yeah but the first problem is knowing that you need one knowing that you need one and that you need to look after it so you kind of need to acknowledge that that that is important and in my experience this is my third language documentation project and so we are not trained for that we are not prepared for that and there there are no easy templates for this uh so you know we really need somebody to take that on yeah the way we look at it is the systems aren't the workflow yeah they know they aren't support the work yeah exactly so the workflow is the human factor yeah and the human knowledge so and the systems are always lossy they are and they can be actually real obstacles so it's and it's hard to build good systems that scale and the systems the systems in you know and digital humanities are often very clunky and very very frightening especially for humanities researchers who are not dedicated digital humanities people you know so it takes a lot of hand-holding and determination to overcome that and my experience as well