 I'm going to talk to you all about data management in my little project, which exists within the bigger project of Asia Beyond Boundaries. My project, Land Settlement and Society, is investigating a series of societal and cultural changes, largely the appearance of new forms of political economy and the spread of Hindu temple institutions that took place during the reign of the Guptors in South Asia, so between the fourth and the sixth centuries CE. A lot of these changes are encapsulated in and are often understood through a series of copper plate charters that record royal grants of land, mainly to Brahminers, which is a practice that really started to spread throughout South Asia at this time. And lots of ideas and theories have been made on the basis of these inscriptions and contemporary texts, but they've rarely been tested archaeologically. So my project is focusing on one particular region, the region of Vidarbha in Central India, which was historically one of the first regions to adopt this practice of land grants. And it's examining the archaeological context of those land grants to ground these inscriptions to try and get at what was going on, society culturally around them before, during and after they were used, and in doing so test certain hypotheses about them and the effects that they are supposed to have had. Logistically, this project involves the work of 30 colleagues, including myself, who work in 11 different organisations across four different countries. And what this project means in practice is that archaeologically we are looking at three main types of evidence. Artifacts, the material culture, the stuff of archaeology if you will, the things people made and used, archaeological sites as the places where lots of different activities were concentrated. These are often categorised as settlement sites and religious sites, and then the landscape, both in terms of the wider environment and the space within which everything happened. Examination of all of these subjects generates lots of different types of data. I'm not going to list all of those, but what's important is that there's a complexity to data in this project. It's not just one thing, it's lots of different things. There's not only a breadth to the range of data that's being incorporated in our analyses, but also the sorts of information that they provide. We can, if we're thinking about data management, reduce all of that down to different categories of data. We have the basic components of the data, the physical objects, the material, the sites in which that material is found and the landscape in which it's all contained, and then the digital data that comes from that or is associated with that, which itself can be categorised into documents and texts, databases and spreadsheets, raster imagery and vector imagery. I know that what we're all talking about is digital data, and that's by and large what I'll be concentrating on, but I just want to make the point here that in any sort of archaeological project we have to recognise that it is impossible to separate these two broad categories of data, the physical data and the digital data. They only have meaning in relation to each other, and both are physically carried right the way through the course of the project, through that data management process. Obviously each of these different types of data have to be managed in different ways. It's not quite as simple though as managing all of the texts in one way and all of the spreadsheets in another way, because each type of data is comprised of different data pertaining to various different subjects of inquiry and all of the different questions we might ask of them. And so they all have to be approached and interrogated and managed in various different ways. So that means that a fairly complex data management plan has to be formulated, which governs every single aspect of the workflow from the collection and production of the data to its analysis, its curation, its storage, how it's shared and ultimately how it's archived. Though of course there's considerable overlap in how each data set is managed at any part of that process. So rather than detail the data management plan for each and every single type of data that exists in this project, I'm just going to talk through each stage of that data management process and highlight some of the complexities that are involved in managing that data. So the first stage in the data management plan is data collection. Data in this project is both collected and produced. I make a distinction here between collection and production. That's largely artificial. I use that separation just to distinguish two different stages in the workflow between data that is collected and becomes instantly interrogatable and usable and data that then is produced from other data sets through a variety of processes. In terms of that initial stage of data collection, data is collected through both desk based research and field work. So looking first of all at data collected through desk based research, that's essentially background archival research, the research that has to happen before anything else can happen. So looking at sites and inscriptions that have been found, pulling together existing, mapping resources and spatial imagery, all of which results in texts of inscriptions in translation, I must admit, spreadsheets of data pertaining to sites and artefacts in those sites, raster images, vector images. And those resulting data sets either become the subjects of analyses themselves, in which case they go straight to that next stage in the workflow, or they feed back into and inform certain decisions governing other areas of data that are collected through field work. Either way, much of this data is digitally born at the point of its collection. It's collected in such a way to enable its use by multiple participants in the project. So we all agree on particular file formats, which for the most part means Microsoft software specifically for documents, spreadsheets and databases, Adobe for illustrations. For raster imagery, we use TIFFs and raw formats as standards. For vectors, we use various formats specific to the programs that use them. All of this is with a view to eventual archiving, to facilitate the ease of eventual archiving, but of course we recognise that many of the file formats that end up being archived will have to be changed. At the point of collection, data is collected according to predefined and commonly agreed templates using predefined terminologies with no additional formatting that will introduce difficulties in file conversions later on in the process. Data that's collected during field work involves a whole bunch of different types of data sets collected in various different ways. We have archaeological surveys, we have excavations, we have sampling, all of which generates attributes about sites, artefacts from those sites, environmental remains, samples for dating, various environments. I'm not going to go into any one of those methods or data sets in specific. In digital terms, what all of that results in are a series of written records, objects, physical material and raster images and vector images. Obviously, with the data that's collected in the field, there is a lot more analogue data, if that's the correct term in these digital forums, that needs digitising. The paper records, of course, the objects and physical remains themselves and that digitisation is done using similar protocols that are already in place that we've already defined during other phases of data collection. On top of that data collection then, we also have data being produced in four main ways. First through documentation, second through recording, third through compilation and fourth through analysis. The documentation stage is crucial. While we are collecting data in the field, we are documenting absolutely everything that we do, so that documentation of what we're doing, why we're doing it, how we're doing it becomes another data set in itself. It's really important that that documentation stage is in place and that that data set is produced in an archaeological project specifically for three main reasons. One is that archaeological practice is inherently destructive. You cannot ever repeat exactly what you are doing at that moment. You can never dig the same hole in the ground. You can never discover that same artefact that you're discovering. You absolutely have to ensure that a record is taken of exactly what's being done, primarily to inform other people who may want to return to that site that they know how to interpret an absence of information in that space that you've been looking because you've taken all of the artefacts from it. They need to accommodate that in their own interpretation of what they're looking at. The second reason is that so much of how we approach and interrogate that data is utterly dependent on where it was found, how it was recovered and why it was recovered. Third, because archaeological sites are being destroyed and landscapes are changing at a really alarming rate in South Asia especially, means that a sort of parallel point to the exercise is to record exactly what you're seeing at that moment in time. That then becomes a snapshot archive to help future generations of archaeologists who might come along in just two years' time and not even see that there's an archaeological site there. We're preserving things at the point of destroying them at the same time, if that makes any sense. For all of those reasons, there needs to be that detailed record of what's being done that then runs parallel to all of the data sets that we are creating. Data is also produced through recording. That's very much pertinent to the artefacts that are found through archaeological fieldwork, the pottery, the coins, the monuments. They're all catalogued and recorded and photographed. What data and thus the amount of data that's recorded very much depends on and is dictated by the material that's being recorded and what analyses are going to take place of that data. Most of that data is recorded directly and stored in spreadsheets and databases, which one is used very much again depends on the material type that's being recorded and what analyses are going to take place. At the same time, multiple photographs are often taken of the same object, which in some instances absolutely have to be taken in a very high resolution to record the level of detail that's needed. There are certain issues of scale there as well, not nearly in the same magnitude as the Buddhist project. If we just think about pottery assemblages, any one site would maybe have 150,000 sheds of pottery. Each one has about 40 different variables recorded and at least one or two photographs taken of it, so suddenly there's terabytes of photos, tens of millions of little bits of data pertaining to individual potshares, all of which needs to be managed. We rely a lot on implementing the same protocols that we've already implemented at the primary data collection stage, being ruthlessly standardised about our file naming protocols, the file formats that we're using, really trying to ensure some level of standardisation and integrity to the data within those data sets. But we also now start needing to link various data sets together in more meaningful ways, so at this point of data production we're also making sure that we're using unique object identifiers within data sets that are carried through into file naming protocols for photographs that can be linked together with the attribute data that's stored in spreadsheets and so on. The third type of data production, data compilation, in many instances data sets are single entities, they can be looked at separately or in conjunction with one another as needs be, but we have another aspect to data in our project which is the construction of and use of a geographical information system, which in a very crude sense is a compilation of numerous digital data into a single geodatabase that then becomes something usable as a resource that is larger than the sum of its parts. That GIS comprises all sorts of different data sets from aerial and satellite imagery, digital elevation models, to certain types of spatial data that pertain to various things in the landscape. They need to be created, points, lines and polygons are generated based on pre-existing information and then also go towards the compilation of this GIS system. In the construction and the use of a GIS there are all sorts of data management concerns, for instance there are different processes that are required to make all of the data relatable to each other, converting them all into the same projections and coordinate systems. And it's also crucial really for the use of a GIS that all of the spatial data, our points and our lines and our polygons are associated with the attributes that explain what they are, so that necessitates not only having relatable spreadsheets that contain the values for that data but also assigning metadata to particular shape files within the GIS. There are thankfully various criteria and standards and conventions in the management of that data that exist, so we're not sort of having to think anything up ourselves, we're very much relying on decades of codes of practice that exist, have to do this. Then there's the data analysis, data is produced during the analysis of the data that's been collected, the results, that's data, all of which then feeds back into other points in the data management plan. So some results of analysis are research findings in themselves and so feed directly into later data curation stages, some results feed back into informing further strategies for data collection, some results of analysis feed back into informing further sets of analysis and some of the data produced during analyses goes towards the construction of our GIS which is then capable of forming extra analyses in itself. In terms of the analyses themselves, various types of analyses take place at various points in time during each stage of project activity. I'm not going to go into the specifics of each type of analyses but just to give you some idea we might talk about and think about the analysis of background site data, the analysis of the text of inscriptions, a whole bunch of different analyses that are applied to different artefacts and environmental remains, sites themselves are analysed at various scales, there's GIS analysis, all of these different types of analysis that are carried out all follow, I have to follow, set protocols specific to those types of analysis, they each have certain processes, they each have certain methodologies and all of that is again documented. We're sort of adding to this parallel data set that runs the entire way through which is the documentation of each and every stage of the process. In terms then of data curation, the fact that data curation is coming after analysis is a bit artificial of course, it's not just after analysis that we think about curating data but it's just for convenience that it's positioned here in this order. Curation we identify as having three main elements to it, one is the storage of our data which of course pertains to both the material objects and the digital data, the sharing of that data. We within the project share data via email, it's quite simple or using a cloud service, all parties agree to and work within certain parameters, preserving and maintaining data integrity, keeping to the same file formats, continually checking for errors. Because we work in so many different locations geographically, there are certain logistical problems when it comes to sharing the data, not every single member of our project team has access to the internet for instance. So from time to time data, digital data, has to be transferred hard copy, there is just no other way around that. So that creates certain challenges both logistically and in terms of workflow and the timetabling of research. And it's also importantly in this post analysis stage that that's our main preservation intervention point and that's where we decide really what data we are going to discard and what we're going to keep. And it's largely older redundant versions of files which pretty much means incomplete data sets that we feel comfortable we can discard after finalised clean data sets have been analysed. Data is disseminated, I suppose in the usual ways in academic formats, that is through reports, publications and presentations. All of the different foci of research and lines of inquiry that lead up to publications and presentations on specific topics have data sets appended to them. And they are, as well as the main outputs themselves, all uploaded to open access repositories, which leads neatly into the archival of data. Zenodo, as has already been mentioned, is our open access repository of choice. With having implemented and being so strict with all of our protocols of maintaining clean data sets from the very point of data collection, the turning them all into an archival format really doesn't take that much work. We are able to just upload data sets that lie behind publications and other outputs, neatly to Zenodo, and there we really take advantage of the fact that they are assigned unique DOIs to create further linkages between our outputs and our data sets, so that they will always relate to each other and are thus discoverable and usable to anybody that might want to use them. There are certain challenges within this project. As I just mentioned, working across countries, we've identified certain weak points specifically in terms of the transfer of data, ensuring that similar standards of data accuracy are maintained across these different countries is something that we are aware of. Though none of those issues are wholly insurmountable, on a different level though, and this is my last slide, all of this works really within this project, how all of these data sets link together, how all of that's managed very much makes sense within an archaeological project, in part because doing this is such well-established practice within archaeological projects. We're not having to reinvent the wheel here, this is how we run archaeological projects, this is how data is managed within archaeological projects and has been for at least a couple of decades. The next challenge, as I see it though, is how we can perhaps take this project, this data management plan, the linkages that are made between disparate data sets within this project and establish links between other sub-projects that are part of this wider project and think beyond the boundaries that still separate them from each other. That's where we still need to do some thinking.