 Good afternoon. I'm going to talk to you today about digital outputs of collaborative humanities research projects. So I'm not going to talk about my own research but more give you an overview of some of my experiences working on research projects. Just to give you a little bit of background about myself, I wrote a PhD on the rock-cut monasteries of the Western Ghats in India and GIS database formed a central plank of that thesis. The appendix of the thesis included many pages of printouts of spreadsheet files which are completely unusable today so I've learned the hard way about data management. I've also worked as a software developer and as a database administrator and I'm now the work for the Beyond Boundaries project at the British Library as GIS research curator. Throughout the presentation I'm also going to discuss a previous project I worked on which was mapping the Jewish communities of the Byzantine Empire also ERC funded and we did some things well on that project and we did some things quite badly as well I'm happy to admit. I mean all in all we delivered the digital outputs on time they were two weeks late in the end which is not bad as these things go the primary one being this online web mapping system but I'm just going to refer to that a couple of times throughout the rest of the project. Okay so digital outputs on collaborative on humanities research projects can be divided into three different types I would argue and we have data resources and tools and I've arranged them in this pyramid-like fashion because they as we move up the levels on the pyramid we have increasing technological complexity increasing an effort involved with a potential payback in usability and the ability to change research behavior. Now that also comes with a risk as we're moving up that the digital output will not be used firstly and secondly there's declining the longevity of these digital outputs as we move up the pyramid so I would argue that any project needs to build a strong foundation here at the base before before moving up and too many digital projects have gone straight in at the level of tools without thinking through these other aspects carefully. Now you'll have to forgive me for this one but we're not working in ancient Egypt we're working in India and so perhaps Shikara this is the Kailash temple Elora is a better analogy than a pyramid anyway so we're going to begin the lowest level data so everyone on the beyond boundaries project is generating data and you'll be seeing their presentations throughout the day today and these data are going to be deposited in a raw form for download and hopefully in a more organized fashion than this now there's plenty of different repositories available for depositing data but our project deals with Zenodo now why why are we depositing these data on Zenodo well the first reason is you have to okay now to comply with grant rules and visit and our delegates from the ERC etc we have to deposit the data but I would also go further than that and say there are very many good reasons for doing so too firstly there are all the reasons associated with open data generally now just to find taxpayers investment is an important one if the data is on Zenodo those data are not going to be lost secondly I think there's some particularly important reasons for our project to develop in countries we should allow researchers in developing countries to see our work we're dealing with with Asia and given the project aims of trying to break down regional historiographies and help scholars break down regional historiographies and boundaries between work open data is important so secondly it's I think it benefits the researchers themselves so you can come back to your data in the future and use it I mean how many of you have had to fish through an old hard drive to find your data set that you use six or seven years ago and found it totally unusable it takes you days to work out what's going on and you can't recreate the analysis or digital outputs that you made at that time so if you clean your data properly at this stage and deposit somewhere deposit it somewhere where you can find it again it's going to be good for you secondly the process of data deposition also and data cleaning also provides new insights if you look at your data and you know your data that well that it's perfectly clean and that someone else can use it you're going to have new insights generate from those data and it also makes the findings of your publications more believable so Zenodo is is as a great platform as far as I'm concerned on there are limitations various people have discussed some of those today but flexibility is one of the key benefits of using Zenodo and but also one of the creates a lot of potential for problems too so the power of Zenodo is cumulative the more data that goes on there and the better keywords are used across Zenodo the more power it has the greater ability there is to link data sets together and a final benefit I'm going to discuss is citation so when you deposit your data on Zenodo site your data in your publications okay I would argue that that is an excellent way to structure the data that you put on there and it allows people to find your work as well people often tend to cite print publications I do this myself I if there's a digital publication in a print publication I often just go and find the print publication in even if I'm using the digital catalog go and find a reference to the print publication in a library catalog because I because that's what I'm used to doing but Zenodo does make it easy to site you'll the citations have this type of structure and you can actually download a citation in various different formats on the platform as well so as I said flexibility has its own problems and we need to make our data that we're depositing useful useful and usable there's no point in doing it if we don't and with the flexibility of Zenodo comes a lot of problems with doing this so I've got some tips here and these are for the primary the project researchers primarily but the rest of you can do we do in this too and so deposit the data in the most basic formats you can find to avoid obsolescence so we have textual data I'm not going to comment on all these formats but PDF a is an archival PDF format I would if you have to deposit in PDF I would I would I would not recommend it but if you have to make sure you use PDF a it's easy on acrobat go to save as more there's options PDF a it means the fonts and any other files that are associated with your PDF are actually embedded within that PDF so it's not reliant on other files in the computer the other point was don't deposit in JPEG I think Robert mentioned that earlier they're a lossy format which which means they lose information when we saved okay so I'm now draw your attention to the TXT format most of you are using text and will be saving and you need to save in Unicode okay that's UTF-8 now it's a massive pain that Excel doesn't export UTF-8 as default so every time you're exporting from Excel Excel is a CSV file a common separated variable you have to save as in UTF-8 otherwise you're going to lose all your lovely diacritics and I know you Sanskrit is loved those and we'd be very upset without without them so I've had a okay so do do save as and if you take one thing away from this presentation do not upload to Zerodo in doc or XLS format okay in the beyond boundaries communities there are quite a lot of doc and XLS files okay don't do if you want to do that as well okay I suppose but basic formats these formats have to be the priority okay so what other tips have I got here I'm structure clean your data use data should be consistent we don't want multiple values in the same column capitalization lack of standardized terms abbreviations use headings that are meaningful in your data okay I'm meaningful to other people with acronyms explained in the metadata importantly look at other way other people have deposited their data and call things the same as what they have done okay don't have everyone using different words for what is essentially the same entity okay keywords keywords are very important part of Zerodo now you're gonna make these up we all do right but also look at what keywords other people are using Zerodo searches keywords using diacritics as well so be careful about that and if you're if you're stuck for using them the Library of Congress catalog has a list of canonical subject headings that you can use as keywords now come to file names these are some of my favorite horrendous file names yeah now just use this this one actually a friend of mine did send me a file a photo with that name so beak no there's no place for humor in digital outputs okay we I would say I mean I'm not gonna go through all these but I'm sure you're familiar with having received files of colleagues with some of these names as well so you know I would say that this is one of Daniel's file names actually which I'm very pleased to receive and you know this is just the within the context of the repository this makes sense it's the ID for the inscription on our sit on platform okay so do change the finance because every person who then has to download your data set which I know there will be thousands of have to then rename the files who on JSTOR for example you have to rename that file every time you download it I don't they don't seem to give you in a nice format they use their own ID right licensing Zerodo gives a creative commons by out licensed by default that's fine for me but some of you may not want your sunscripts used by any commercial company and so you can perhaps want to change that to prevent commercial use okay something you should think about too and finally the breakdown the data that's something for discussion really do you upload all your data as a single as a single like zip file as we discussed this morning or is a single in a single repository with all the information there or do we break it down into the individual section so that individual piece of the data could be cited separately that's something the project needs to discuss further right so that's the data so that's once we've got that in place that base of the pyramid in place I think we can then move on to resources now many what do I'm many of you contribute to digital resources and I'm talking about the specialist academic resources like Gretel or Sarat or the archaeological data service but I'm also talking about broader resources like flicker Instagram you people in the project may want to think about publishing in those areas it's quite gratifying to see your images reused in people's blogs wherever okay so it's that's that's that's that can be nice too now I'm defining resources as domain specific in that they deal with a single type of data often with a specific period or geography geographical area or language and they're usually presented on the web in a format that's understandable and accessible so you don't need to you don't need to download a file to look at those to look at that those data that's how I would define a resource I mean we could argue about that why now why deposit your data on on resources and there's a community already in place there's they have a larger and broader audience and if you deposited your data separately they're often more accessible for example they're optimized for search engines they can take advantage of added value functionality such as display on the map and they often give you a canonical URL for citation as well which may be more useful in publication than a Zenodo DOI because you can take all you all your different pieces of data from the same place now choose wisely when selecting resources there's a lot of effort often involved in structuring your data to put on the resource and to meet the standards that the resource requires what if the resource goes down how long will this resource be available for you so you know read the terms and conditions or and some of these resources you have to pay for as well okay so we have our very own beyond boundaries resource I'm Sydney UK this is it's recently gone live but it's still under development and I think Daniel will be talking about this in more detail tomorrow so I'm probably not going to go into too much detail but we're hoping scholars will contribute their inscriptions to Sydney and over the remainder of the project we'll be trying to promote this and it's used by further projects in the future I'm doing with Asian and script it's South Asian inscriptions and inscriptions from Asia more widely too so we'll come back to that I'm running I don't know if how much time I've got left really I was going to talk about link data for a short time but without explaining the technical considerations of web data you may know this is a semantic web this is a way of exposing sharing and connecting pieces of data using a method of structure and data with a particular variant of XML or JSON and there are many ways to use web data but I'm interested in for our project in connecting digital resources that have something in common whether that's place person time or another entity so by simply using an identifier that's common across these resources it's possible for machines and also to harvest data and aggregate it in the kinds of portals that we heard about this morning with new mathematics and I would argue that every project needs to think about this because we need a world on the end of the web where people can get from resource to resource through a graph like this and if you have an interest in a particular place you should be able to access data from all the resources about that place not just that one resource otherwise otherwise you're you're you're cut off okay and in South Asia there's there's a pundit which is a gazetteer and a prosopography of it's quite um well I don't know how new it is but it's certainly still in its infancy I would say and I'm providing these and wouldn't it be great if we could just click through here if I had an interest in so and so to put on and click through there and find data from on different inscriptions from coins from texts all about that place and um link data is the own really the only way we're going to achieve this and any project working any humanities project needs to be thinking about how they can contribute to this collective effort okay so what does this mean to you on the project will come and speak to me and we can see if we can get your places into pundit or another gazetteer and what we can do to try and build links between your data and others so we're on to the final digital output now which is tools and I'm defining tools is something that facilitates an analysis of the data set a tool to interpret or manipulate data often providing innovative results and there can be great variation in the types of manipulation or analysis you can do these tools talking about visualization annotation calculation you might be able to feed in the data yourself or the data must set might exist within the tool and examples of tools include read or the map in the Jewish communities the Byzantine Empire project I told you before that was a web GIS set of web maps for browsing data on those communities and that's also a tool so now it's it's rant time now unfortunately so as someone who's been is built a few of these tools themselves and witness a lack of uptake in their use I would say it's very important that keep in the humanities we understand these tools take a lot of effort to produce okay yet they have a finite a finite life and due to the speed at which progress the technology happens and they really need to burn quite brightly in that period in which they're useful and to make them worthwhile now not enough effort has gone into to production regarding their use the target audience the impact they might have and how that might be measured you know web design in academia does not have the harsh feedback with profitability that is in the private sector and so large projects can gain funding and continue to be produced for quite a long time and before any feedback is sought whatsoever on the interface or whether they're having an impact or how that how they might be reconfigured to do their job better so in the essence tools are treated often like publications but it's not clear that they are objects of research in their own right so I'm gonna if I have time I'm gonna talk briefly about one such tool which is our Orbus okay now I love Orbus I think this is great okay it's a it allows Roman communication costs to be estimated in terms of time and expense and I am and reveals the true shape of the Roman world okay now basically you can specify an origin in a destination across the Roman world and you can find out the time that it takes to go from one to the other now this is great it's a very specific use case and it's had a huge amount of money I'm throwing it from Stanford I think it's a great system it's very easy to use it's had a lot of publicity you know there's been articles about this in Telegraph Forbes the Hindu it's an impressive interface there's a lot of scaffolding around it they had workshops to use this tool you know it's been embedded within some practice but if we look at that 150 citations for this tool on Google scholar the tool has been it's now six years old that's not bad 150 citations but virtually none of them use the tool for what they want what it was intended for which is measuring distance people say oh this is a lovely tool people say oh I'm not happy with the GIS that's underpinning this or they've not taken account of this problem they're not taking account of that problem but no one use there are very few citations of this for that use it to actually measure distances in places and improve the research because of that so that's an indication of how difficult it can be to embed research to a digital research tool in scholarly practice okay now so to sum up on projects have high level goals and eye-catching problems to solve and many have ambitious aims which helps them to gain funding fine you know Mac and the Jewish communities was supposed to help medieval historians use GIS you know that's a big that's a big ask I need the uptake of GIS on beyond boundaries you know is supposed to break down regionalism and historiography I'm a lack of interdisciplinary research the national orientation of research these kind of questions which is you know very admirable aims and but can the digital outputs help to address these I think the data and resources can by depositing our data and resources they can help meet these needs and in projects more generally and theoretically digital tools can also help meet these high level goals but the successful uptake of these tools is difficult and has to be thought about carefully and we also should not forget that for one of these tools you can there can be many traditional publications and those themselves can move a paradigm forward to address these questions and provide examples for future research so these digital outputs do need to be thought about carefully thanks to hear your opinion on relative merits of early and late adoption I think the general experience in mathematics has been that the most successful projects have been those which adapt tools that are well established and well entrenched rather than developing new tools for tasks so just curious you know that distinction between developing new tools versus waiting until the tools are well established in other fields to bring them in. Yeah I think that if you have a tool that's established in disciplines close to the one that you're working in and it's and it has been used well there then there are practices in place you know there's oh there are methods of transferring skills between disciplines there's a familiarity there and apart from all that there is also a proven use case for the tool and that would that would make it more likely to be that would suggest that there's a there's a more available market there for it so that's that is yeah that's a good that's a good example I mean I can't think of an example I suppose it like GIS in archaeology and history for example GIS was picked up early in archaeology and I'm slightly later in history and I'm in that early period historians went through a lot of effort in order to try and a source went through a lot of effort to try and build those bridges between the between the use of the tool and the two in the two disappoints. So we train every year about you know 30 documentors would bring them out to London and train here and also we train people in the country so we go to Africa and China and train in there and one of the things we always tell them is if there's really cool research project that develop a tool for linguistic annotation don't touch it. Do not go close to it don't be a beta tester because when the funding runs out it goes down you don't get the data out right. So what is one of the problem of the short funding cycles that are then created for a particular project tailored to a particular project and you don't know if this can be sustained afterwards because funders also don't fund the maintenance and further development of something right they fund new stuff new stuff new stuff so there's a structural problem within the funding stream that doesn't allow for further development of tools. So in linguistics for example one of the most bizarre situations is that the most stable tools the lexical databases that our linguists worldwide use are tools that are developed by the missionaries by Christians by SIR right so and they don't want to use them but it's the most stable it's the best developed software that is out there and there is nothing on the market that can be compared to them as open source and non proprietary. But they have a constant funding stream and everything else that was developed by others like by the Max Planck and Nijmegen or you know by the DOES project had a life cycle of three to six years and then went down no further development or by the CNRS with Amina Mituchi and went down so it's one of the biggest problems with tools. Yeah they are you know the web moves forward very quickly and unless you can build a real momentum behind a tool it's very difficult for it to have any longevity. You know making the tool open source is to some extent can help but building an open source community around a code base is very challenging unless the people working in employment full time use that code on a day to day basis and have to make changes in order to do that. They are job more effectively open source code base projects generally full flat so because you know obviously people will do things in their spare time but in order to make major software upgrades you know it's a lot easier if people are using that in commercial setting or in their not just commercial setting but in their day to day work. All of this said I'm just curious whether you can think of any tool, digital tool developed in a funded humanities research project that has been an unqualified success. There's one actually I've been using called Recogito. It's annotations basically you annotate any type of text but it's mainly being used by classists and the annotation produces the link data that I was talking about before so when you annotate the text you annotate. At the moment it's working with place names and then after doing that you link the, you say you see room you link that to the URL of room in a gazetteer like pliesies and then that edition is then published in the Recogito editions which are then made accessible so people searching room can then find that. There's been a big uptake of that actually. There's a lot of editions in the Recogito edition base but whether then that's gone through to the users, I don't know how often people are using that in their research as yet. I think that there is another very obvious example. The portable antiquities scheme is by far the most successful humanities digital tool project in the world but it's supported by British government at enormous cost, as enormous numbers of staff, institutional backing, right, so all of these things we've literally just talked about and it runs off technology which conceptually is very old and almost quite old even when it was in place so they, you know, all of these things were just kind of discussed are present there and I think there are now, I was told over 100 PhDs played dependents on the portable antiquities scheme data recently so, you know, this is the, this is a clear example of a project that has succeeded. It has led to research outcomes. Oh yeah, enormous numbers of research outcomes, yes. And then if I may, about JPEGs, we digitized these two photo albums in the BL, right, of an archeological survey of India, Burma Circle and the BL kindly gave me JPEGs and TIFF files for every image and you can't, like where I don't know, I can't upload the TIFF files, they're too big, I mean I've let it upload to Zenodo for 20 hours and then at some point it just dies, the connection, so I've uploaded the JPEGs because I was able to. Now that's fair enough, yeah, that is fair enough. Now it's a good point, it's not always possible to upload this big format. I would say that in most cases, for like natural pictures, so if it's not line art or a color diagram or something like that, but a photograph, even if it's a photograph or a description or a line art, a JPEG, a good quality JPEG is not going to make a difference that matters. The issue with JPEG is a migration, well, if the JPEG is open to the subsequent reset, it loses a little bit of data. So if all you're doing is archiving an image, the risks are much smaller than if you're actually using images, but if you're archiving it some way it has to save things later. Unless you're doing that as a change, the opening is retouching the little part and resaving it as JPEG and then doing that again 10 times in succession, then it still won't get to matter. Yeah, we'll let you away with a couple of JPEGs. I have a user's question, which is not at home about Daniel's brilliant filename, but what was missing is any reference in the filename itself to a version number or something like that. Shouldn't that be an important part of the easily accessible information about that file that I should be able to look at and see where it fits? Inside the Zenodo versioning system. Who knows if he's gone back and fixed his file and re-update? There's a date. There is a version date then that's available in the metadata. If you want to go back and change something on Zenodo, you can only upload a new version, but if you upload something to Zenodo it gets a DO update and it becomes fixed from that point on, but you can't go back and change it. You can upload a new version on top of it, but you can't go back and change the original. So, I would argue that as an ODO output should not have a version number in it. I would say that comes in the metadata. If I have downloaded that XML file onto my computer, how am I going to know? If you didn't take notice of the version number on Zenodo, that's your problem. Wait, wait, wait. Before you blame me, how am I supposed to take notice of that? Where am I supposed to store that information? The place where you downloaded it from. That means that you're talking about the same problem which we talked about from JSTOR, which is now I have to re-type a file name. So you want me to input the date that I downloaded this into the file name? Not all the metadata can fit in a file name. I understand that, but what I'm saying is where am I supposed to keep track of that information? Unless what I meant to do is to go back and not to download the material, not to download the XML file, but to go back and retrieve it every single time. Unless you want to download, specifically, you want to download an earlier version. You would just always download the latest version from Zenodo, and that's that. So I'm not going to have anything locally? Because you may have a version of the file locally already, then you go into Zenodo, oh, there's a new version. It's a fair point you're making. You could make your file name match the Zenodo version as well. Ideally, you could do things automatically in terms of if you're relying on some set of files for some project of yours, and you want those to be up to date, you can have it push those files every night or once a week or something like that. And then you know you always have the most. I'm not arguing in particular position. I'm just wondering how can, with the assumption that starting with the best position, I could build something which is not going to need to be fixed later. Let's think about things like that. Where should that information be? This is partly you're asking for someone who's filing their data to handle your own workflow problems. One is most interesting about this discussion. It tells us also something that's great. You put everything in Zenodo, who knows how to use it, right? That's the thing. It's a very advanced kind of knowledge system to know. You download it, you work with it, you want to know. You don't keep local copies. You work with it always to be useful because this is the logic that is behind it. This is what we see with our linguists. They have never worked with the digital collections in our archive because they're not used to teaching and in training. So they have a really hard time understanding how to create them. And Zenodo is great. It's fantastic. But there's a job in the training from the user perspective of how to use it, how to download it. Having it sitting in Zenodo, one of the questions is who are we serving? Because it's a very, very small community of users that are able to manage this kind of data set. Well, yes and no. So the developing countries, sharing with the developing countries is kind of a funny idea. Well, yes and no. I've been here standing here for quite a long time now, so I know I should drive this up. You were in plenty of time. Yeah. Questions were more responsible. No, it's just an indication that you're sitting here. I would say Zenodo is... It's picking your brain. I would say Zenodo is a very simple platform to use at first sight. And then, I mean, you can just download a trial. Okay, reconstructing like a relational database. Like, see, I've worked with researchers from many parts of the world and I think that they're perfectly capable of downloading TXT files and looking at and viewing them. That's, you know, or, you know, a CSV file in a spreadsheet is quite... It's not too difficult. I know, obviously not everybody can do that, but I think... Yeah. I think... Yeah, I like the story. Hi, hi, hi. Somebody's got me a moron. Thank you.