 So I think you've all met Kaki before, she's a visual content curator in ELA and she's very kindly offered to talk to us about practices, recommendations and explorations for R-Code. Take it away, Kaki. Do you need to do the summons before that or are you going to...? No, no, we've got a class after this. Oh, alright, fine. They couldn't get enough field-mended this week, so... Oh, wonderful. Six hours. So, except for Serge, the rest of you are field-method students as that arrives. So you've been there, wait for all there when I did the lecture in Julia's class. No. How many of you weren't? Okay. ALDD. Yeah, ALDD, yes. Yeah. Maybe he wasn't there. Maybe he wasn't there. Maybe he was there, but his mind wasn't there. His mind wasn't there. Okay, because I didn't know how many of you guys would have shown. So I've prepared some of these slides, so I'll skim through them very quickly because it's stuff that I've already talked about. So what I was hoping to cover today were things like what is an archive, which I might have talked before. The relationships between archiving and languages and language data, and in particular, endangered language data, which might be the interest for some of you, talk a little bit about the research data life cycle and how important it is when we talk about archiving and best archiving practices to make sure that we have good data management practices in place early in the whole process. And finally, I will be going through advice that perhaps maybe Peter has already talked about, but trying to sort of shape them with respect to archiving and preservation recommendations. And giving you a list of resources and by all means, you know where we leave in the archive, feel free to come in and ask for more. So you've seen these pictures before. Can I start by saying what, after having had an interaction with the archive, what you think an archive is and what you think it does? Paul knew much quite. You know, you would have a quote you like. It's a cemetery. Graveyard. It's a graveyard where things live. Or like die. No, they don't die, they live. They're buried there to carry on living. So all of these things can be regarded in an archive. An archive does things like making sure that we keep data safe. We store it for the future, preservation, we call that. We provide facilities for people to access and search that data, so we call that discovery. And according to what the community, and that's in particular in relationship with endangered language documentation and linguistic documentation and cultural data, to make sure that the wishes of the communities involved and the depositors wishes are respected when it comes to access. And some of the archives also have a value or a name to provide a platform for sharing data and sort of work as a facilitator between what the users do and what the users would like to access and what the producers or depositors are producing in terms of data. You've seen that before? So if these are all the archives we know, so that includes not necessarily digital archives, but also what other archives can we have? What other archives could we have? Apart from digital. Apart from digital. I've asked that question. Yeah, books, papers. Yes, paper archives, that's very good. Digital archives are a subset of that, and even small subset of the digital language archives. So for digital archives you can have other types of data like statistical data, or government data, or economic data. And then even smaller subset of that are the indigenous languages archives or the languages archives that specialize in endangered languages. So I'll just skim through this. And here's how they work, and you might have seen that graph before. So we have producers, data producers, people who go out to the field and record data. What they do is they produce the data and they give it to the archive in some form or another. We will talk about what would be a recommended form for this data that comes into the archive. So the archive takes it, does some work on them. We will talk a little bit further today in more detail about what the archive will do. It links things, or it takes people's already created relationships and includes them in what they have in their system. And what they do then is they make that available. They send files, as well as the relationships, as well as any added value they have created while working on that data. And they disseminate it to the users, send the users of the people who would like to go and look at the data, access it, do something with them, or work further on them. Sorry, before I move on, any questions about this picture? So far so good. So in addition to the services or advice of the information I gave in the previous slide, a language archive can offer help with language revitalization. It's a very important role of an archive that has such material. So the data can be used to create usable language material for communities either by the archive or other researchers that are accessing the archive. Another thing would be another thing that language archives do and offer is quality and standard, so they would check the data are properly documented so not any material can enter the archive, making sure that they're consistent, they have as much information as possible so they're negotiating with depositors to make sure that what comes in and what goes out is its best shape and form. And advice and training, it advises depositors like I do today to you future documenters as well as elsewhere in other platforms organizing workshops throughout the creation of data when it comes to issues around archiving and how people can reuse their data. You are a digital generation, correct me if I'm wrong, why do you think it's important that data nowadays, documentation data is digital or why do you think it's not important? Why do you think it's useful or not useful? Accessibility is one, yes. So anyone can listen to an MP3 or a web file on any machine or like could be on a mobile phone or laptop, a tablet, anything else? That's a very good point. It's much easier to migrate into newer formats as well because a format might become obsolete but it is much easier as you say to move it into future formats. So here are some of my thoughts. One of the best ways we have that media can be preserved for the future. It can be copied and transmitted with zero loss. It's much, much easier to catalogue, share, disseminate as well as store because imagine a huge library, loads of manuscripts and books and how many bookshelves and how much space you need to store all these manuscripts and now think the equivalent into the digital world and how much less space it occupies. And as was mentioned you can access anything, anywhere, anytime or anything. And users of language archives are of course the communities, depositors, people like yourselves in the future creating data wanting to go back and access more or add to it because they might have gone again to the field and done a second round or other researchers comparative historical linguists again people like yourselves coming from the researcher point of view or the user point of view and other people like educationalists or generalists in the word of public. And why you think language archiving is different? I mean having worked with Peter throughout your field methods course and having looked at some of the audio training and some of the material hopefully that the archive has why would you think that language archiving is different? What are the issues that are special to what we are doing? Yes. It's very important for language archiving to have metadata accompanying it, accompanying any recording because otherwise it might be very difficult to establish and actually understand the value of the particular recording and teaching a piece of audio video material. That's a very good point but do you think that's specific to endangent or language material? I think it's more so specific to endangered languages than with any other kind of data or any kind of archives. I mean it's important to have metadata accompanying any kind of but especially when it comes to endangered languages where we might never get a chance to record that particular language ever again. That's a very good point. That's a very good point. Going to the field and recording a speaker who might be the last speaker of that language carries a lot of responsibility for whoever goes there both for the depositor as well as for the archive because that material is somebody's heritage, somebody's life story. It's not numbers. Other thoughts are what do we mean by language? It's something... It's not something but it's words that somebody speaks. Are they a language? Are they a dialect? Are they just words? What is the context? What is the story they carry with them? Another issue and another interesting thing is that data is not just TXT files so you have greater variety of formats that you need to know how to work with. Your colleagues in the economic department or your colleagues in the government department they will not probably look at video data or audio data so they will probably not worry at all about how to handle it. But yourselves, you will have to record as part of your documentation, audio, video do the annotation. You'll have much greater variety of data to work with so you will need the skills to be able to do that. Of course, large file sizes and greater variety of data formats diversity as we mentioned before and the sensitivities and restrictions that engage in language discrimination carries and of course I would say extremely high priority. I think there is something special about endangered languages in that they are typically having their domains of use restricted often to very domestic first and more kind of context and that ties in so sensitivities might be cultural restrictions but there are also sensitivities that come about in a sense because of the nature of the language domains of contracting the kinds of content that you are going to be getting not just the form but also the content particular and has to be dealt with in a particular fashion. Absolutely, absolutely and the word sensitivity is a very generic term as well so depending on the culture or the community or the individuals you will work with you will come across different experiences and different perceptions what sensitivity is and that links very much to issues around consent informed consent and access and who can access and what they can access. So another kind of point which is often the number of speakers or the number of people who are competent in it which is very different from what you get in a bigger language so what that means if we are talking about preserving materials so that other people can use it if you preserve material in English or Polish or Russian then anybody who speaks English or Polish or Russian potentially could access it because of that but if the recordings are only in a language that a handful of people can translate and transcribe then it has a very different status one say than a conversation that was recorded in Russian so I think that's the fuel to engage in materials the notion of preservation is a very different one if what you're preserving is something that anybody would be able to annotate I'll add it to that, that's a very good point thank you, I'll add it to the list and also makes the metadata business really important much more important than you know, than everything else any more thoughts or comments on this? yes? that's a very good point so you have to work with whatever you have really rather than being so you don't have the laxary let's say to look into so if it's the last remaining because the last few remaining speakers you need to make the most out of what you have in front of you what you can work with that's a very good point so you have to work with out of what you have in front of you what you can work with wonderful point, thank you anything else? I'll just come through this you can have a slide if you want to look at what other archives are doing I've just given you some examples of a variety of different types of archives, so you can have local as well as international you can have archives attached to a research institute you can have physical versus digital versus mixed and so on and so forth I'll skim through that as well because you probably have seen that either in Peters or in other lectures I had it there so I didn't know how much but you can have a look at these pictures these are some of our data providers speaking communities community members of the whole communities and this is who we are which I will also skim through because I'm sure you've already heard about who we are and what we do we are a digital endangered languages archive in case you're still dirty in that and we do all these things we mentioned a while ago and here is the breakdown of the types of data we have and that's what our portal looks like I would it would be the section that would be useful, very useful for like my talk today is the section called sorry depositing so all the information I'm giving you in a bullet point form like the whole pages explaining to you our recommendations to our depositors and you've seen that and you've seen that and you've hopefully all registered by now if not I've made some printouts of the guide if you need to and you have heard about our protocol would it be worth asking what each of the letters mean because we've spoken about this before so what does you mean user old users yes that's very close what universal is all registered users it's universal but for registered users yeah what about our researcher yes people who have registered and who we know that our researchers if they're independent researchers institution approved research do you have to give approval to them that status so we don't get that automatically they have to apply for it and we check see yes somebody said it members of the community speak to community that the depositor has individually approved so they have to apply to the depositor and the depositor will approve them which is our last category subscriber which means true the registered users who they've applied for access to that particular deposit and the depositor has approved has approved them and we go to the important bits for today so as I said all the good data management issues and practices that Peter's talked about in the past so also are vital for good practices for archiving so if you follow good data management practice then your data will be very easily to archive at the end of it so for example things you could do that you should do as part of good data management for your data during your project would be to document decisions why you use that file naming and not the other have you changed it say somewhere that today 29th I've decided to change my file naming system as follows and I've only changed this folder and not everything else because you probably will not remember it in two or three months time document steps, document your conventions when you're annotating things document your structures, document your encodings document your formats video or audio files make sure you have you are following appropriate and conventional data encoding methods and that might sound like Greek to you but what are some of you because some of you know Greek but that actually means make sure that your characters are unicode they are a particular format that can be read forever this is like the preservation format be explicit and consistent consistency is very important plan for what you're going to do how you're going to export some file from an editor or from software and how you're going to put it into another and as I said good data management practices means good and easier archiving for the future what is data management you've covered that haven't you but can somebody remind me what data management is what is data what do we mean by data it's fine just like the idea of flowing what is an opinion what could data be let's not make the question come on come on so data information that's a good definition other ideas about what data is it might help if you talk about data in terms of the purposes perhaps that might be easier to define yes these are all types of data yes that's wonderful wonderful so what does data management mean come on making sure that it's used and recorded and stored efficiently good well done somebody else I'm going to start shouting now because you're falling asleep anything else when you work with data like you went for your audio training right you went out and did recordings you grabbed a recording and did recordings what were the things that went wrong everything went perfectly fine everything went wrong what about these yes Nana data management included noting down the name of the recording the time and other things identifying the files of all this that you put them in and then trying to be consistent with putting them in a particular place wonderful so making sure that the links are there you have metadata and you've recorded the right information for what you've done to them so that's an instance of data management it actually covers everything from the moment you grab your recorder and create a file to the moment to the point where you will actually do what you want to do with it edit it, annotate it and leave it on one side to rest before you archive it it is something that you probably have done without even knowing for example when you create a document file for your assignments right how do you do that do you work on the same copy throughout the three weeks before the assignment or do you make well whatever or do you make multiple copies that you name in a different way so how do you work when you work with a document like that okay the assignment that you did for Peter the recent assignment you did for Peter how many files different versions of that file do you have one one two, good two stored on the same computer no another computer and only in a pen drive and the pen drive good I like that anything else computer and dropbox that's good did you print a copy out you submit it you submit it one okay so these are all instances of data management it doesn't have to apply to an audio or video file you're doing it every day when you move your WAV file your favorite music and you convert it to an mp3 so that it's light and you can play it on your mp3 player that's an instance of data management because you're converting to alternative formats I'm just trying to show you how often you're actually doing it because we're dealing with digital data every day without even noticing for those of you who are interested perhaps Serge might be interested in the last bullet point so data management plans is an outline of the good data management practices you promise to follow throughout your project and this might not be a requirement for you guys for your assignment it might not be for you either but it's a good it's a good resource to go and have a look because it gives you what large grant funding agencies like ESRC Research Council UK the NSF and the States what they recommend as a good data management practice it's a questionnaire that you have to answer about your project so it's a good place to go and see whether what you've done already or whether you've thought about all the different aspects of data management and these guys are experts they've got very large archives so they update these data management plans quite often this is exactly what we're doing straight up oh right okay we've discovered now we've got files all over the place we actually need them so that link could be useful or not so you should worry about managing your data because it makes your own research easier it protects your viable data better quality and research data without any loss means better research and it enhances your research visibility which means that if your data is out there and it's accessible in a nice format people have more chances of using it other than if it was in a property format no one could open and compliance ensures compliance with ethical codes data protection and other laws and of course easier archiving is at the end of all this and the most important point out of all these is the first one you need to remember that the data you create especially when it comes to endangered language documentation they will have a lot much longer hopefully, lifespan than yourselves and they're likely to be used by community members in the future to be able to look at what the language sounded like or what the rituals were like in the past so think of it in that context as well while you're doing all that hard work so it might sound or seem tedious but if you think of it in that larger context it makes it all worthwhile and you've probably heard about all these things already so things to consider would be what type of data you will generate during your research ethical and legal issues copyright, data storage, backup is very important consistent file and folder management and in the case where you have more than one researcher how do you manage that quite a few of you so what I would like not just you guys but everybody I have a question at the end of the whole thing is how many of these things we're going to talk about today apply or are the same or are different when it comes to the context of working with a larger group of people so I'll ask it at the end so all of you have had some sort of interesting group work so keep it as a question in the back of your mind while we're talking about all these issues and see what are the are all these issues common are there any differences are there any additional issues that perhaps you've come across and you would like to ask me about and they know about that already right so I will just not do work about it so from the point of creating data what are the steps so you process them you analyse them you preserve them you give access to somebody or more people hopefully somebody uses it and then create some new data and then the cycle starts over again so I'll skim through all this information I'll have the slides if you need to have a look at them it's easier that way so when preparing your documentation for archiving here are the things I'm going to sort of go sorry go into further detail so talk about file naming and folders with respect to archiving now because of course as we said you've received already some guidance on data management on that so what do you remember from your data management class when it comes to data file naming and organisation what were the good things to do what were the bad things to do being consistent is very important yes trying to input too much metadata into the file name yes it's not a good thing that's very very important so consistency in documentation is I think to me the top most things to do and then the semantics is the third one importance anything else so have you seen that example before you need to be careful about what characters are in it yes yes that's very very correct so that goes back to what we talked about non-UDFA characters so a script that perhaps your computer you can see on your computer like Chinese characters or Tibetan characters or Greek characters that do not actually show up properly in other systems okay no it's fine dashes and underscores are fine spaces are not a very good practice especially double spaces because a single space does not confuse a computer that much or a system that much but a double space some systems cannot understand two of them so they will interpret as one there is a file with two spaces in it that you cannot open or you cannot do anything with you can only you might be able to rename it but not necessarily so by just having a look at this file what are the information you can gather yes that bit perhaps yeah yeah yep that's that's from this one I guess it's an audio file it's an audio file you can tell from the WAF file uncompressed yeah well you wouldn't know because maybe somebody has renamed it but chances are it's probably yeah it's probably a sorry it's where the Akiva skicks in but what about this one or could this be it could be a language yes it could be a language it could be a language yeah we don't know could be a language consultant's name initials beginning or linguist's name okay we don't know so for this particular example that's what the person that's the convention that the person followed but you realise how important it is to have that written somewhere because chances are maybe a member of your team won't know the conventions you used and the speaker ID was formulated in that particular example by taking the initials of the speaker but that's true so maybe it's BE from Betsy and Spinoza from S maybe I don't know I stole that example from somebody else David's slides okay any questions on that you are happy with this you are familiar with that so all the guidance you've received for your data management and files applies to archiving and files so document your standard formats wherever possible if you have information or somatic information in the file name in the same order wherever you use it put it in the same order and keep them simple and consistent it's very very important do not use bases use dashes or underscores UTF-8 is important some people might say asky only because you wouldn't know if it's UTF-8 or not but UTF-8 is good make sure all files have an extension we get loads of files in the archive that don't have an extension especially from market users it's just nightmare because you have to sort of rename it and try and find out what file it is or run it through some software to tell you what kind of file type it is avoid double dots and do not forget to record information about the files and their relationships some examples of different folders just to show you what people depositors are following in terms of their folder structure any of this is possible so long as it's consistent documented and follows the rules for file naming that we mentioned before so here it is this person what have they done can you see the letters they have their elan files together and the media files together their elan files look like this their media files look like this so what they've done is they've followed they've named the related files with the same file name another example this person has chosen to group things by session so within the same folder they've put all the related information the picture the audio file the annotation and any preference files another example is just listing everything in the same folder so no folder structure at all and they've used the file naming to group things and then the metadata will tell us what links with what and additional information and here's another example where the same collection had loads of languages and they first grouped everything by language and then they've given us the dictionary right any questions I'll move on quickly to I can give you I don't know if you've come across any of this software we use them in the archive to manipulate data file names and folders and structures so I've included them there just in case at some point you need to do the same as possible program software to use I'm not affiliated with any of these people or any of these software it's just the ones that I've found easy to use and I'm sort of telling you what they are the other point is especially when it comes to software make sure that when you start your project for example you create documents in .doc extension and in 5 years time Microsoft decides that I'm going to forget I'm not going to use a .doc extension anymore I'm going to use a .doc extension so make sure that you keep up with the latest updates or versions of its software and make sure that you migrate to the appropriate or use my advice and convert everything to preservation friendly formats then you won't have any trouble have you come across this distinction of different types of formats file formats that you will have during your project life cycle so you'll start with the original choose some derivative or some working format of that either because you will need to edit it or cut it, trim it or group it with something else then you'll have the final copy that will be the one ready for preservation are you familiar with these terms or should I go into more detail do you want to go over them okay so when you did your audio training you created an audio file and that was just what the machine named as sequential .wild files and that's your original so the native format as you get it from the recording device now I don't know if you did anything with these files but if you were doing this as field work documentation the idea would be that you would produce different versions of that you would select a part of it to include in your analysis or because you would group something with something else and present it as the same file to further annotate it so that's your working version you can have multiple more than one working version so the same file but you will only have one original you need to make sure you keep the original and be able to go back to it and see what happens and for your working versions also make sure you have the one that looks best so if you were working on trimming it have the trimmed version somewhere kept safe and then carry on and doing more work on it and finally the preservation copy is the copy that usually the archive will get but it will be your best version of your working copy converted in a preservation friendly format and I'm going to talk about preservation friendly formats in a minute yeah you said the native format was extracted from the recording device but that doesn't necessarily mean the name as a sign for the recording device no you could still have an original but you had renamed yes it will depend on what you are doing it's set up as in how you would like to work with your data so you might choose to back up somewhere safely the original as an original because for some reason you wanted to have a record or you wanted to have files as they were without being renamed equally you could rename it and then make a note of the association so 201.wav was renamed to 2013 1129 kc 01.wav and was placed in that folder so you can keep track of your changes in different ways depending on how your workflow is set up so after this let's think about what are we doing in the field now do you have any questions about the three different types do you expect the files that you get in the archives would you already expect them to be in preservation format when we do it you don't it's my question and it's another question what percentage actually is in that format that you get I wouldn't be able to report on the percentage depending on the type of file we have different policies so so your text files we normally keep the original that is given to us and we make preservation copy for those of them that are not in the right format and I'm going to go into detail as to what we do for each file, for each format sorry now I don't have stats about what percentage of data that comes in I don't we get people who are very well informed and have worked really hard on the collection and have got the original as well as the preservation copy in which case that means very little work for us on that front but equally we get people who just give us the originals and they say here it is and we work on it and produce the preservation format I can work on it and report back but I don't have the stats in my mind I'm afraid today but it's an interesting question interesting question so this is a novel view of the different types of data that as an archive we usually come across so we have audio files of course video files, text files annotations in various software formats most of the linguistic packages are good in terms of so they have an HTML export option or they are already preservation friendly so that sort of is good in a way images and structured data such as databases spreadsheets and other relational data so for audio files it's kind of pretty stable in the field not your field in the archiving field that the preservation friendly, the best format, the recommended format for audio files is a WAV file full resolution WAV file so loss less compression with settings recording with settings of 44.1 kHz you've heard about that before 16 bit stereo unless there is a special reason why it shouldn't be stereo it should be mono some people do specialised recordings of process information where they have to use mono they have to use two channels to record something so unless there is a particular reason it's stereo I think is that right? Is there any reason they are beginning to channel making it stereo? Do you remember what Tom mentioned in the audio training I think maybe he talked about that slightly when we were listening to the mono and stereo he was saying that if the mono is the original then we might want to have a input into audacity and produce a stereo for the sake of our listeners listening on both ears rather than on one ear I don't know I think mono files when you play them you will listen them on one ear is that what you are asking can you ask the question again ask the first question yes please for the archiving purposes there is no reason to make a stereo if your original is a mono we don't recommend for you to turn it into a stereo so that's that part of the question the second part of the question is should us documenters produce a stereo if our recording is in mono should we do a stereo or should we create a stereo version of that file is that what you are asking that's true from the archives point of view if your original is a mono then that's your original that's what you deposit if you produce a stereo it's a derivative it's not your original now the reason why as Tom had mentioned you might want to create one so you will keep your original but you might want to create a stereo I think what he had mentioned was just for the sake of your listeners or the people who listen to that recording of course it won't be stereo information but it might be more pleasant if they listen to one yes on two eyes it's a mix of recordings in a derived product for example a talking dictionary and some were stereo and some were mono they would drive your listeners crazy so what you might do is if you want to use the mono ones is to do the duplication and then everything will be stereo some would be original stereo and some would be derived stereo but for listenability in such a context answer a question but from the archivist's point of view if your original is a mono we want the mono so if you have it because some people destroy it any other, yeah no no please please please too is it a loss some kind of lossless lossless well WAV is a lossless it's a lossless there's also a flag yes there is so there are less flags I am not sure about the recommendation for flag I know that it's a lossless but that occupies less space but the issue with flags as far as I understand is that not everybody perceives them as being as good quality as lossless as WAV files so I don't I will need to look into that to give you proper advice but I don't know the answer to that but from my very little knowledge about audio that's what I know but perhaps that would be a question we can ask Tom because he's the one who knows a lot more about it than myself but the reason why I know for sure that when it comes to archiving that flag doesn't have the same status as WAV files so WAV files is taken by everybody to be the best but flag is fairly recent it's an open format but it hasn't received as much appreciation as WAV files for that reason how much time do I have do you normally get a break in between do you want to have a little break would it be okay if I go over the audio slide and then we can break for 5 minutes, 10 minutes and then you can come back hopefully any more questions about that so video I don't have recommendations on video no one in the field has recommendations on video it's not a very stable yet when it comes to advice advice about what's best preservation-friendly format changes every year as new formats come in text is very stable it's text.txt tab delimited comma-separated files or pdfA1 pdfA1B or A if it's documents that have formatting information that you need to preserve as I said annotation files will depend on the software provided that's readable by an editor so Elon Toolbox Flex, what else are you using Transcriber they all output or can export in XML their formats are already so the format in which they save the files are already preservation-friendly because you can open them with an editor to get the content images JPEG and TIFF TIFF is the best preservation-friendly format but it's the one that occupies more space so it's good for scans and scanning manuscripts and all books JPEGs is usually the format that we get images like from digital cameras if you're scanning excuse me if you're scanning make sure your resolution is 300 dpi or better if you want to scan something you don't know what I'm talking about come to the archive and ask us some people have scanned field notebooks that they've handwritten and they've done it in the lowest resolution and they're not very good they're useless in the sense that we could have had the same information much better scanning quality especially with the scanning equipment we have nowadays structured data spreadsheets databases we encourage people to give it to us in the format that they've created it but so always keep the original but we normally when we get it exported in so the tables if it's a database into a CSV or a tablet limited format spreadsheets if they've got colours and other information that cannot be preserved otherwise we also create in additional to CSVs we create PDFs so again we talk to depositors and of course when time comes and you need to archive with us or another archive it will be good to get into conversation with them to see what the recommendation would be so I will stop now let's get started okay Peter explained to me that after this lecture you're going to be working on in more detail in some actual data that you've done with recordings so I would like you to interrupt me from now on because all of the things I'm going to talk about are practical things interrupt me and ask questions thinking about what you will be doing in the next few hours before the film night okay so looking at audio files we've talked about so these settings should be settings that you set on your recorder if for some reason you've made a mistake keep the original do not destroy the original and then you might want to change it from .mp3 to .ua file you shouldn't really but let's say you decide for some reason to do so but always keep your original regardless of the state of it remember your audio training go back to your notes go back to your experiences and see what were the things that I listened for or were the things yep I don't know I don't know could we could somebody make a note and ask let's ask Tom yes so it would be higher resolution but he wouldn't be able to listen to the difference right mm-hmm mm-hmm mm-hmm mm-hmm mm-hmm mm-hmm mm-hmm right mm-hmm mm-hmm mm-hmm mm-hmm mm-hmm mm-hmm mm-hmm mm-hmm mm-hmm mm-hmm mm-hmm mm-hmm mm-hmm mm-hmm mm-hmm mm-hmm originals and working formats. Once you have your original and then you make a copy and start working on producing your working format, your derivative format, feel free to edit, trim and join as appropriate. If there is a silence in the beginning where you were trying to sort out how the machine was working, how the recorder was working, if there is laughs or coughs or a discussion you don't want included because it's not part of what you would like to include, feel free to trim it out. Keep your original though because sometimes maybe that discussion in the beginning is useful, maybe it becomes data in the future. That's why keeping your original is very important. Be selective, so select the bits that are the ones that you're interested in working in for some purpose and of course think of the potential of this audiophile being used by the community and other researchers. If you were working on, we were working on a large collection recently that looks at, so it was a date on nasalization, very specific, very detailed, very focused. To produce that as derivative but keep the originals just in case for some reason somebody else would like to look at another aspect of your recordings. Do not apply any filters, effects or noise reduction to any audio to the original. The archiving as well, the copy we would like in the archive, we would like it to be without any of these filters. You might want to apply them if you want to do something with them, so for example you might want to be able to transcribe them in which case you might want to enhance them or slow them down or do other work on them. That's why it's worth keeping multiple versions of that. Of course, multiple versions come with a challenge of its own, how to manage the multiple versions and that's another discussion that perhaps could have. You made a mistake or if the recorder that you used recorded an MP3, do not destroy the MP3s. So some people would convert them to wild files to hide their mistake and destroy the MP3s, don't do that. Give the MP3s to the archive, it's fine. We'd rather have the original than a bad converted or manipulated file, if you see what I mean. Of course, for any decision you make when it comes to converting or trimming or joining data, always make a note of what it was before, how it ended up being and how the process. So you had a file called 1 and a file called 2 and you joined them and you renamed it into kakiasi. So make a record of that so that you remember that the corresponding files to your kakiasi file were files 1 and 2. It's for your sanity in the future, you do it once and then it's a record then, it stays with you for life. And that's a recommendation for the archive, so if you deposit digitized material, include information of how this was done, I will at some point find out the best angle for that microphone so that I can hear myself. But it's fine at some point. Any questions with regard to the audio files? Are you all happy? So we need to ask Tom about the higher frequency and we need to talk to ask him about the flag files. So make a note and let's ask him. Preparing your video files. Now, I said that for audio files, things were pretty stable. People would very much agree on what formats were the best. For video files, it's still something that is under discussion and debate. So two years ago, the best recommendation was MPEG 2. Now, it's no longer a recommendation. People don't like it. They go for different ones and they change. So that's why, I mean, we asked the joint information systems committee, which is the biggest archiving organization in the UK, otherwise known as JISC, maybe you know it. So their recommendation was keep your originals, whatever these originals are, are they .MTS, are they .MPEG, are they .MOV, so the native original that your camera records in and then produce converted versions of that into different formats depending on the best practice every year, let's say. So you see the importance of keeping the original because the easiest way to migrate would be not from a derivative but from the actual original file. So you need to make sure that you archive the most appropriate quality for what you collect. So the difficult thing about this is that you need to make a judgment as to what is the appropriate quality and what is the genre. So it depends on what the video, what you're shooting really. So is the video a picture or a film about a landscape? Yes. And there are some people doing something in the middle, perhaps. Now, would you need a very high definition or have a very high detail for that? This is a question I can't answer. It will depend on what you want to do with it in the future. So you have to think about and judge based on that. Format. As I said, whatever the source, MPEG 4, 8264 or MTS, keep the original and have a working format. So for example, when it comes to E-LAN, if you want to annotate a video file in E-LAN, it won't like very large files. So there are limitations when it comes to software. So what you would do is you would keep your original, create a derivative working format, a compressed quality and work and annotate that. Then I'm going to talk about what you should do, what we recommend you do when it comes to these products and archiving. Again, as with audio files, edit, trim and join as appropriate and be selective. Think of the things that are useful. Do not include shots of the ground unless it's necessary. So suppose there was a mistake in the beginning and you do this with a camera and then you do that. Do not include the soil thing unless it's necessary for some reason to include it unless you were actually shooting the grass. And do not forget to recall metadata about aspects of the video, such as what equipment you used. Was it on a tripod? Was it somebody holding it? Who was holding it? Codecs. What was the original file codec? What was the one you derived? What was the working format codec? What software you used to convert it? It's fairly important with video files because not all software convert in an equally good way. And of course the relationship with other video files and other audio files just in case you extracted the audio file from that video and other annotations. So far so good. Any questions? Are you dealing with video files in what comes after this lecture? Just audio files. Next time you're going to do so. Now, this is a very interesting question that we've had at the archive that we have had to make a decision or have a policy on. And that was people more and more use video cameras, HD cameras, and they are very high quality videos that we get. And there are issues about space now and how many different copies of the same video files you would keep. So the policy of the recommendation we are currently giving out is we would like your original file and we would also like, suppose you created any derivative files, any derivative video files that you worked upon by creating an annotation. So you've annotated a video file, an MPEG or an MP4 using E-Line. And you've time aligned your annotation on to that particular file. We want that too. So you'll have your original and then you'll have within your working formats, you will have other, not originals, but versions that will be your master copies, let's say, for different uses. Suppose you want to create a film, because you're a filmmaker and you would like to show a high definition film to the community, then there might be another master copy of your original, that you have edited, trimmed, selected. But that is of higher quality and is in a different format than the one you put into E-Line to annotate. Does that make sense? Right. I mean, of course it's all theoretical now, unless you actually do some work and get your hands dirty. But just keep it in mind, next time you work on video files, these might be issues that you'll come across. So we say, as I said, the archive object would be the working file most closely associated with the research goal or method. So you'll have a master copy of that working file. If you were sending us data, us, meaning E-Line, we would rather have them as files than DVDs, because some people might burn DVDs to give to the community. So it's fairly hard to extract menus and everything from that structure, unless you produce a huge ISO that people can download and burn. So we'd rather have your audio, sorry, your video files, your digital files. And as I said, include as much information as you can. And the important thing about this is that you record that information at the point when you create it, if you can, because otherwise you'll forget. Don't worry about the last point. It's again, if and when you would be depositing something with our particular archive. And of course, depending on who or where you give your data, there will be different requirements. So it might be best to check with the individual archive you would like to give your data to see what limitations or restrictions apply. I think I'm done. Any questions for the video files? How many of you have worked on annotating an audio file? What software have you used? E-Line? Anything else? Any phonologists of traditions? Pratt? Transcriber? Toolbox? Flex? Not a toolbox, yeah. Flex? Not play. Next term. Sorry. Spoilers. So most of them that we mentioned, and perhaps some others that I've missed, you can either export in a XML or text format that's good for preservation by the export manual item. The rest are .eaf files or .tip files for the tool. Not for the toolbox. So toolbox and transcriber .trs files and Pratt files pretty much. The Pratt files nowadays, not the old version ones. They are pretty good as they are, so you don't have to do any work on them. So just to give you an idea, people might give us their audio files first, and then what follows is they work on the transcriptions, and as they work on the transcriptions they give us the data. So it's not unheard of people depositing their data in installments. And we encourage people, and I think it's a good recommendation for you as well while working on your data, for the files, audio files or video files that you haven't yet annotated to write, sit down and write a short description of what is being talked about. Because it might be, it will be useful for other people and it will be useful for yourselves to know what information is included in that file without actually having to listen to it every time, even though you haven't worked on annotating yet. And I've given you the link with the advice that we give depositors normally. It's all part of that section on the website that I showed you in the beginning. Any questions about annotation files? You're not going to work on any annotation files for this after just organising your wild file, okay. I won't ask anymore. Text files, very, very standard and stable at the moment. So preservation or archives preference is for plain text files. UTF-8 encoding is very important because especially when it comes to multilingual content, you need to make sure that when you open it in the different systems, you can still read it and that the characters do not break. Of course, ELAN or transcriber HTML files are also types of output that we can read quite easily. So you can get text file in HTML format, avoid Word documents if you can, or you can use it as your working format so you can type in Word, but when you create the preservation copy that you want to share with other people, make sure you create either .txt or .pdf, a version of that if formatting is important. Again, make sure that the encoding is the right encoding. If you plan, if you intend to save your text documents as HTMLs, do not export from Word because when you open this up into an HTML editor, then it's all got loads of junk in it that you don't really want to have when you're looking at that as an object that you archive. So best to copy and paste or save as .txt or save as a .pdf. I think if you have Word 2010, it saves as a .pdf if you select in the setting, so it's fairly straightforward and standard nowadays. Questions? Images? Again, make sure you include your best pictures that show something meaningful and not just a dump of your digital camera. So there is approaches of selecting the ones that are relevant and the ones that are good. Make sure, of course, like all the rest of the data, you include metadata with them and any relationships to other sessions or other files are explicit as well. So of course, yes. So the last point is that perhaps a picture is a thousand words as well as a picture is a thousand minutes of video in the sense that it might be easy to demonstrate something by taking a picture of it, snapshot rather than videoing. Of course, it will depend on the circumstances, it will depend on what you want to depict, but do not undervalue the importance of taking a nice picture or even a diagram in some fields. So plants, animals, maybe it's not the particular animal, the particular plant you want to take a picture of. It's the prototypical images of that particular plant. So maybe diagrams are better way of showing what you mean. And of course anything you have, maps, scans, images, consider archiving them if you have the right permission to do so, which gets me very nicely to, I've talked about that. Oh yes, in terms, in a technical note, use optical not digital zoom. Because digital zoom won't, well it's much better than it used to be, but still it's best to use your maximum optical zoom and then crop the picture or do something with a picture other than zoom, use the digital zoom and sort of pixelize it. And here's a point that is another inquiry, another question that we have recently got with respect to including scans of manuscripts or scans of books unpublished or out of print books that people have found or come across into their journeys. So if that happens to you, then there are certain things you need to be aware of. So photos or scans, bless you, photos or scans of unpublished or published material, manuscripts and photos of participants that are all very useful and very important if you have the right permissions to archive them with an archive. So if it is a book that you're digitising, make sure, it's a published book, make sure you have the permission from the publisher. If it is a manuscript, make sure you have the right permissions from the author. If it's pictures of people that you want to archive, make sure you have the right permissions for that. Make them aware of how you're going to use it because the archive won't be able to accept it even though it's very, very valuable and it might be the only instance of that book unless that copyright is clear, that information is available. Excuse me for a second. So I've included some links that you might find useful in your research about different copyright laws and what constitutes intellectual property in different countries because laws are different. If you happen to have, to come across something like that, to have to make a decision, consult with your archive and nowadays if you're in a UK university usually there is a service, there's a person who's responsible for information, the information compliance person. So check with that in the, if you are so us, if you are other places, check with other places where you are. And that's a note and kind of like a hint or tip to all of you. Go into the menus of all the software you will be using and explore the export and import options and see how much you can export and how many of these software allow you to export into preservation friendly formats. And also something I'm going to go, to talk a little bit about is how easy it is to set up backups. So some software might, will make a backup of your latest changes. So Elon will make create a .eaf.001 file which is the backup of your previous change. So explore all these different possibilities and options because it's good to know that they are there. I have got some notes about character encoding but I don't have time to go into it. So if you have any questions, ask Peter, me or David. It's just information about what it is all about and some ways of checking and some software to use to check whether something is UTF-8 or not. You know about all these hopefully, right? I want to, I want you to, if there's one picture of this session to remember about metadata is this. What could you tell me about this picture? Have I shown this to you before? Good. Right. Anyone has any idea what this could be? It's data. Hold on. I'll try. I don't know. I'll try. Let me try and get rid of this. So it's a sequence. So you have all the zeros on the first line, all the ones, the twos. We're trying to see if we can make it bigger. It's a punch card. Yes. Yes. Well done. It is a new version of punch card because there were older versions of punch cards with bytes, zero and one, zero and one. But that's a newer version of punch card. Much like if you've ever had, if you've ever taken a multiple choice test as computerized where you do A, B, C, D or one, two, three, four, it's very similar in that sense where you black out. In this case you punch the letter. So this is data. This is questionnaire answers of somebody. So somebody was given a questionnaire with seven, from the range zero to nine, is it? Yeah. So you can see the number of questions here and what people put as their answer. So what are they? Okay, I'll show you the next picture and then perhaps we can consider the issues together. So that's this. That's the other one. What could that be? It's a tape. Yeah? Maybe. About a Wednesday. About somebody called Wednesday. Very good. Wonderful. And the last one, that's a more recent example. So these are obviously file names, right? What can you tell me about these four files? It looks like my assignment files as well. And loads of my presentations actually. Final draft, final final. So that was too close, sorry. That brings you to the question of versioning and how you will need to manage your versions, especially when you're working in a group, right? So what are the things that would have made our life much easier? So what's the missing information? What's the missing link? What would have helped us understand what this data is all about? Metadata. I'm grateful that Nadja knows everything because no one else says anything. It's not for you, Nadja. It's kind of please help me. No, that's all right. No, it wasn't, it wasn't, it wasn't, it was the complete opposite. It wasn't meant to be. Thank you. Thank you for answering. That's the, that was the, so of course metadata and also brings up questions about labeling and how clear our labeling is and the importance of semantic labeling perhaps or the unimportance of semantic labeling. Another issue is punch cards. Do you think we would be able to process this data nowadays? When you said this is a new version, how? Oh, it was zero and once and that's how you would do like bytes. When you say it's a new version, what year is it exactly? I don't know. I think it should be 60s or 70s. So it's fairly, it's not new but it's fairly, I've seen older versions of that. So we can't exactly, we need a key, we need to know what the questions were. We also might need some machinery to go through, to be able to make that or go through that. Same goes with the tape. That's why I've brought the old play players here, tape machines, so that we can play some old media. But I mean to me, as these to a side, this is an unacceptable way of having your data, your file naming. So in an acceptable sense that, you will drive you crazy because if you need to create a fifth version of that, how are you going to name it? Final one. Real version, which one is the real, which one is the non-real version, which one is the fictional? So versioning like something, I think something concrete to that file such as a date or a time. Also initials by the person who last edited it, or these are different ways of doing versioning on file. Also folders can help you put things in a structure, or you can even have a spreadsheet and recall that information there. So you know all about that and what in digital language the commutation should include, so I want to go into it. Two more pictures. One is this. I've got ten minutes, or do I have to stop now? I've got five minutes. Okay, I'll go through my pictures very quickly. So that is, I'm sorry, but we had to blow up your laptop and it highlights the importance of having loads of backups in loads of places. This is from an actual example, like field work happening. Not linguistic field work, but other field work. So that's the source. You can go and have a look at her blog and see what it was all about. Other examples. You have your data in a data center or in your university's computing servers. Something happens if they catch fire and that's your only copy. Now you're hoping that they've made, they've got a preservation plan and backup. The backup the servers, but you should never rely on one location only. You should always make multiple copies of your data and store it somewhere. I've got some points about risk of loss, thinking about whether your data will survive a disaster, protect it against not simply fires or bullets, but also damages in your USB falling on the floor and breaking or being washed with your trousers and your washing machine. Simple things like that, that can happen to anyone rather than the scenarios I just offered you. Of course what I said is important that you have a strategy rather than randomly backing up things in random places. It is very important that you agree and document what you're going to do. Here are some issues to think about and nowadays it doesn't have to be a very expensive process. It doesn't have to be even in the cloud. It could simply be an offline hard drive that you buy and have somewhere and give it to your mom to put somewhere safe or give it to somebody else in your family and have another one here or have another one elsewhere. Other problems with respect to data is when your media becomes corrupt so you need to be aware of that as well. It's one of the advantages of having multiple media and multiple locations. Just to make you aware that CDs and DVDs are the ones that deteriorate like degrading quite easily compared to SD cards and flash drives and hard drives etc. I'll leave this up to you and here's another picture. That's my last picture I think. Here is another one. Somebody once said that in daily language documentation is very much like medical data sometimes. Some of the parts of the data that you record are rituals that might be sacred, sacred rituals that only particular people can see. So it's up to you to make sure that you have the right things in place when it comes to protecting your data from unauthorized access and making sure that your computer is not accessed by everybody at any point. So you leave your computer lying on in the library and somebody steals your laptop so if that goes all your data. So have you set up password? No, yes. So things to consider and think early on while working with your data. And if you need encryption come and find me at the archive. If you don't know what encryption is again come and find me at the archive. It's a way to store your data so that not everybody can immediately look at it. So it's a way to hide your data or scramble your data so somebody cannot straightforwardly open them unless they have the key of the password. And I think I'll finish on that note. Sorry that's the last picture. So I'll finish on that note and are there any questions? Sorry. I found that these pictures are always a good way for you to remember things rather than because bullet points get lost. Yes, yes. Thank you very much.