 The ARF Wednesday lecture, thank you for coming. Sorry about last week. The speaker who was going to come, I don't even know why, but couldn't come. Something came up, so he is speaking in two weeks, I believe, with 30. He was sick. OK, sinus infection, that sounds painful. So we're going to hear from him again soon. But today, we're very, very lucky, and I'm thrilled to announce Dr. Sarah Kanza, who hopefully all of you have met at some point. She now works with ARF, all of us here, and you probably can interact with her on various different occasions. Very important participant now. But she also works for, she says open contest, but I thought it was Alexandria. I'll explain that. OK. Alexandra started in 07. She got her PhD from Edinburgh working on faunal remains from in Turkey, or Jordan, Turkey, and then has continued on now working in Italy. So she has worked in a series of projects and is very engaged in archeozoology, but also involved in open context and open research, open data access for people of archeological material for scholars to work with. So many things. And they come together in, I think, her title, overcoming specialist silos, which is also very important for us as we want to try and blend and bring data together to ask more interpretive and more nuanced questions. So I'm going to turn it over to Sarah, and I hope that you all enjoy her presentation very much and get a warm welcome. Thank you. Thank you, Christine. Thanks for having me. So I'll just explain what I am now half time at ARF, and I also run a nonprofit organization half time, which is called the Alexandria Archive Institute, which is founded in 2001. And over the years, we ended up developing an open access data publishing system, which is called Open Context. And we recently actually melded the names together. So now Open Context is also the nonprofit organization. And so it was just what we were becoming known for more than this really wordy Alexandria Archive Institute, which isn't in Alexandria, and it was causing a lot of confusion. So anyway, I'm now promoting the sort of open context side of things. So this is Open Context. The system itself was launched in 2006, and it's an independent nonprofit organization based here, actually, in Berkeley. And we originally got funding from the Hewlett Foundation to demonstrate how one might bring together data on the web. And this was back in 2003 when Google was even just starting to become a thing. And so it was really early on, and Hewlett was funding open educational resources. And it was just whatever by luck, we ended up going to them. And they were like, this is something that we would like to see done. And we'd like to see what you can do in terms of bringing archaeological data together. So we, I'm talking of Eric Kansa, who's my husband, is the co-founder of this effort. And we've worked together now for 15 years on this. We both got PhDs in anthropology and archaeology and had spent a lot of time, as you all have and will be doing, collecting lots and lots of data to analyze for our dissertation research. And we were both really frustrated that at the end, most of that never saw the light of day. It got sort of summarized or whatever, but we were never able to share it. And being fresh out of grad school, we were like, maybe we should try to do something about this. And that's where the idea started for this open context thing that was launched a few years later with help from Hewlett. Fortunately, we're now seeing increasing attention paid to sort of data archiving. And so these efforts are getting supported more by different granting agencies. So we've expanded our support. We're mostly grant funded. Increasingly, now we're getting funding from data management plans that people are writing into their grants, which is great. I think probably like 90% of what we published in open context now is coming from the funded grant projects. We now, open context now has over 100 data publications. And I'm going to go through some examples of what data publications look like and our approach to publishing data. And it includes over 1.5 million items. And again, I'll explain why there's so many. It has to do with the approach to sharing data that we take. And this includes a whole lot of images and media. And part of that is because actually when we first started, it was like the days of Flickr. And we first started out thinking, ooh, wouldn't it be great if we did this thing where it could be all like crowdsourced and people could tag things. And there could be a lot of images and everyone could share their stuff. And then we realized over time that people actually were more interested in having something that was more like a formal publication and not so much of this sort of loose data sharing, but actually something that looks more like something that you can actually share and maybe get counted towards your tenure and promotion and that kind of thing. And so I'm going to start out this talk by giving you some examples of the data publications. We have an open context and a lot of this is informed by my work as a zooarchaeologist. So if you go to open context, you'll see that there's a whole lot of zooarch data in it. Part of that is because zooarchaeology is a very sort of data heavy field and we collect lots of data that can actually have the potential to be integrated. And so it was a good sort of low hanging fruit to start with, but it's also because I had a lot of connections in that area if I'm having worked in zooarchaeology and so I'm going to talk about what we've learned about data publishing from zooarchaeology, but then I'm also going to move on to talk about some zooarchaeological studies that have informed our understanding of data reuse after the data is published. So just really briefly about open context, again, it's been developed iteratively over 10 years so we didn't build something that was static, but it's something that we're trying to improve as we get feedback from people over the years. We really focus on linking out so we try to link to other systems that are data sharing as well because we realize that we can't do everything so we're working on doing our part well and then linking across to other systems that are approaching data sharing in a different way so we can together start sharing data across the web. Everything, it's built in all open code, all the data is open so we are careful about making sure that people can actually share the data that they're sharing. We can't do everything so we are sort of picky about that, about what we publish to clear it with everyone that it's okay that it's open access. And we are back in, I think, 2011 and 2012, the NSF and then the NEH, the NSF Archeology and NEH Office of Digital Humanities. They both now list open context as one of the suggested places that people might go to to fulfill their grant data management requirement. We have actually, we archive with the California Digital Library here and then elsewhere across the world. We have mirroring at the German Archeological Institute and other places in order to keep the data safe. And over the past few years we've received a few awards in recognition of this work and so this stuff is really starting to be sort of recognized more as important in our field and so we got a digital curation award actually for a paper that we gave about some of this work and the AIA gave us their outstanding work and or outstanding contributions to Digital Archeology Award in 2016. And in 2013 actually the Obama White House, the good White House, recognized Eric Khanza for his work as a champion of change and open science and had him go out there and talk, which was pretty awesome. He didn't get to meet President Obama, unfortunately. So obviously, so open context is involved with data and data's a hot topic and it's a subject of lots of investment in the worlds of government, business and big science. And the economists even recently said that it's the most important resource in the world now. But today I'm gonna talk about how data fits rather awkwardly in our day to day practices in archeology. I'll give a few overview about why data is often hard for us to deal with and how with open context we're exploring ways to make data less awkward and more effective for archeology. And then I'm also gonna discuss my own work with our archeology and coping with how to effectively share specialist data. So I'm gonna discuss two main sides to this. The challenges associated with motivating good practices among data creators and then the related challenges in encouraging data to be reused. So we're all familiar with the phrase publish or perish. Academic researchers are obviously under a lot of pressure to publish in the right venues with the right kind of impact. And the kinds of publications that count for getting a job from a promotion for tenure are all very limited. Of course this leads to some perverse incentives in archeology. We're rewarded for certain kinds of publications but these kinds of publications typically present only a very tiny portion of the research documentation that we create. Since archeological research and especially excavation can be destructive we need to find ways to motivate our peers to be more comprehensive in sharing their research results. So motivating data sharing is an important issue for cultural heritage and stewardship. This isn't just a problem in archeology. Lots of other sciences have trouble motivating more comprehensive data sharing. In response to the Obama administration, in response the Obama administration began requiring grant seekers to provide data management plans as you all know. This was a good first step but it's not enough. Data management plans often don't get much attention perhaps increasingly so. And the publisher perish pressures continue so digital repositories tend to bend over backwards to make things easier for data creators even if that creates long-term difficulty for data re-users. So the goal of most digital repositories is to make it as easy and as painless as possible for researchers to deposit content. That way they can keep costs down and not frighten away researchers who have other publication pressures. But the problem is that's sort of a garbage in, garbage out kind of issue that if digital repositories only try to make things easier for depositors, re-users may suffer. And we worked with a study called Dipper led by our colleague Ishelle Faniel who is with OCLC which is a library research group and Elizabeth Yackel who's a dean at University of Michigan High School. And this study explored some of these issues and reusing data stored in digital repositories. And I'm gonna talk about that research a little later. So sorry for the vegetarians but not surprisingly metadata and data quality issues represent big challenges. A messy and poorly documented data set may require lots of work to fix, understand and clean up for reuse. So to meet some of the incentive challenges to not only share data but also to share understandable and clean data, we're exploring a model of data sharing as publication. We hope to encourage similar effort and rewards in the dissemination of data as in more conventional publication. We want researchers to invest effort in making good data and we want researchers to see recognition and rewards for that effort. So this shows our sort of elaborated open context data publishing workflow where we have added these areas of documentation, review and editing and annotation. And so we work with people's data sets to actually make sure that they are cleaned up and that they make sense and that they're sort of, they're annotated with things that can link them out to other data on the web to make them more useful for re-users. So I just wanted to talk a little bit about how open context differs from most digital repositories. So it extracts data from contributed data sets and uses those extracted data to add to a single common big database. So in contrast, most repositories store uploaded databases as individual files described by some metadata. So there are advantages and disadvantages to these two approaches. Open context approach is much more expensive and time consuming because we have lots, we have to put lots of effort into cleaning data, extracting data and fixing problems in the data so to be published online. In contrast, most repositories are more focused on metadata rather than on the content of the files. So in most repositories, in order to see what's inside a data set, you have to download the data set and open it up with the right software. However, because open context extracts data into a common big database, we do overall reduce some problems with complexity. Each contributed data set can be searched and queried through a common set of software tools and services. So you don't have to build a new system for each database. So I should also note that open context only has a common system of organizing data. It retains the data author's own system description so that each data set can have its own typology and descriptive information, et cetera. And so this is kind of initially more expensive because it puts a lot of work on the ingest, but in the end, it is cheaper because re-users aren't confused and not able to use a data set and they don't have to do all that kind of cleanup on the other end. So open context puts extracted data into a common database because so much relevant archeological information is often scattered across many different data files in a given project. So here's a coin from Domus Tepe, which is a mostly Neolithic site in Turkey, but which had a late Roman coin hoard on part of the site. And the data about this coin came from four different data files. There was a relational database that's stored the context information. There was a spreadsheet that stored the photo log. There was another relational database that stored thousands of small finds records that this was part of. And then there was another spreadsheet created by a numismatist with more specialized descriptions of the coin. So basically the complete record of this coin was scattered across many data files and open context publishing workflow brought all of this scattered information together into a more cohesive whole, which simplifies access and analysis. Rather than in a traditional archive, you'd have to go and download all four of those different things, find the coin across all those four different things, and piece it together that way. So it simplifies the research process in that way. So if you wanna read more about these sort of comparisons, there's this somewhat dated article, but from 2015, which is really clearly well written by Beth Sheehan that compares, in this case, TDR and open context, but it's sort of a more generalized, more traditional archive with the open context approach. So basically, as this data sharing is publishing set up, we have tried to make it like a publisher where we have editors, and then we have an editorial board of experts who can review the different data sets for consistency and quality and that kind of thing. So we do go through a peer view process that makes sure that the data sets are sound. And just a word on intellectual property. You typically use data in somewhat different ways than you use text that you read and one important use of data is to combine different data sets from different sources. But in order to do that, you need to have legal permissions. Furthermore, open licensing simplifies preservation of data. With copyright permissions granted by open licenses, we can and we do archive data with multiple repositories in the US and overseas. So all data publications and open context have open copyright licenses, which is usually the Creative Commons attribution license, which essentially gives explicit legal permission to use the data and media as long as you cite the creator. And we fully recognize that not all archeological content should be published this way. There can be ethical problems with open licenses, especially in context with histories of colonialism. So we explicitly ask that research is engaged in community archeology so they can understand how to ethically participate in open access and open licensing. So now I'm gonna move on to a few examples of publication types in open context. So other than adhering closely to our data publishing and intellectual property guidelines, we take a broad approach to what kinds of data we publish. So we realize you can't just build something and expect people to just follow that model. So we don't require full excavation data sets. We realize that people have different motivations in sharing their data. And we've really tried to work with the data creators to meet their needs. And so for that reason, we have a whole sort of wide diversity of data publications in open context. And I'm just gonna show you, this is sort of our different types that we can cluster them into. And these types are increasing actually, I'll mention at the end of the talk. So the standalone publication is where an individual or team wants to put their entire project on the web as a standalone data publication. Which basically they end up, they collect in the field, everything they collect in the field ends up in open context. And one example of this is the site where I work in Italy called Poggi Givitante Murlo. And they have, as you can see here, they've got all these different types of finds that they're over time putting into open context and linking together. This is from 52 years of excavation. So it includes old field notes and things that were just on paper that they've over the years scanned and entered digitally into their infield database and then are putting in open context. This is an example of a record of an ivory object from Poggi Givitante. And it's linked to images, mapping data, context data, authorship, et cetera. So open context provides that common infrastructure for publishing and linking all of these different sorts of content within the project. So on the bigger data side of things, we also have several projects that are multi-year efforts at integrating data across broad spatial or temporal ranges. And so just an example of this is the DINA project, which is the Digital Index of North American Archaeology. And this is an ongoing, really big project with, it's funded by the NSF and it's with colleagues at University of Tennessee Knoxville, Dave Anderson and Josh Wells who's at Indiana University South Bend. And we're basically working with the SHIPOs with the State Historic Preservation Offices to get the site file data from every state put into this common infrastructure so that you can search across state boundaries and you can do sort of this bigger picture research across the US rather than having to go state to state to state to find that information. The idea is that everything is, the site locations are all approximated on a grid for a 16 kilometer by 16 kilometer grid so there's no site locations released. They are linked through their Smithsonian trinomial number for site ID. And so the idea is that over time open context, it's published in open context and that acts as sort of a gazetteer where each site has like a peg and resources from across the web that refer to those sites can all be found based on being linked to that Smithsonian trinomial number. And so it's hoping to improve access to information about all these sites from across the web over time. So this is a very long term project. The dots in California are because we have a collaboration with the Hearst Museum where Hearst objects that have Smithsonian trinomials are plotted onto the map. And if you need to, you can find out that that's there but then you have to go to Hearst to see the object for now. So it's sort of an example of how we might link museum objects with this system. And this DINA data has actually already been used in some interesting research where they used, there was an exploration of sea level rise impacts on coastal archeological sites. And it got a lot of press because I think very few people realize just how many archeological sites there are in the U.S. And this is tens of thousands of sites that are threatened by rising sea levels over time. So then we have, increasingly we have people who are coming to us to help fulfill their grant data management plans. And this is one example of a PhD student at Harvard, Max Price, who had research on cementum analysis of peat teeth. And he just shared that data that was related to his PhD research. And so what's nice is you get your citation which has a DOI for the project. But then every single item in the project has its own citation too. So every single pig tooth, every single fragment has its own webpage that has a unique citation. So you can point specifically back to this item if you wanna talk about it in another publication. You don't have to say, go to this archive, open this bridge, you look for bone number, whatever. So it's a much more disambiguated way of doing research. Also it's nice is each project and open contest can have its own customized banner and look. So it kinda looks like your own webpage. So Max actually had a colleague, I think in Turkey who did these watercolors and he added those as his banner, which I thought was kinda nice. It adds a little bit of your own personal creativity to this site. That being quite so clinical. This is Lizzie Wright, who is based in the UK and she did a biometrical database of oryx and domestic cattle measurements for her PhD. And then what's fun is that you can search all of hers, but then you can also get all cattle measurements and wild cattle measurements from wherever else in open context there's actually data from. And ideally from elsewhere on the web too, eventually. We also support archival research. This is showing a bunch of drawings, excavation and conservation data about the Spanx. The monument is changing all the time. It has challenges with tourism and pollution. So this is a good way to help understand how the monument is changing over time. This is part of ongoing publishing work that we're doing with Mark Laner's group on publishing huge amounts of excavation data from the Giza Plateau. And the next data set is actually a Paleo Botany data set from excavation of one of the more domestic contexts at Giza by Claire Mallison. So then we also have people who wanna publish a data set that contains data that is related to a publication of theirs. So basically a smaller data set that they're linking to from a publication. And so this is an example of a publication that was in antiquity that put a link to the data set that was published in open context. And so readers can go back and they can look at the primary data set and then they can reproduce the claims made in the paper rather than just sharing like a summary data table. This is a larger volume that does a similar thing but it's more like an edited volume. So basically they have a print volume about archaeology of Mesoamerican animals and several of the chapters had data that they wanted to share. So in open context we set it up also like an edited volume. There's a main page and then there's sub pages for all the chapters that have their data sets that are related to chapters in the print volume. And then finally we have supplementing monographs. And so when there's just too much data to put in print this happened for the Petra Great Temple Excavations which just had a ton of images and data that couldn't make it into the printed monograph. And so what we did was we published all the data in open context and then the data creator, Martha Jukowski, she asked for links to certain things that she wanted to refer to in the publication. And so we gave her a huge list of all these links and she's now linked up her print publication to individual items like these nice drawings and stuff that couldn't necessarily make it all into the volume. So another category is emerging actually where people are wanting to make these sort of community created databases where they sort of set up a page and then ask people to contribute certain things so that they can build out their data with their community. So for example, we have this osteometric databases, South American Camelids. We have a new one now on Chinese Petrography which actually has instructions in English and Chinese and people can download a spreadsheet that has it in boths and then you enter your information and you submit it back in. So this is something that we're really interested to see how this pans out. It takes a lot longer because you're asking for lots and lots of people to share their data and it takes a long time for people to get their data organized. So for now they just wait and get promoted in this way and share things like downloadable spreadsheets for people to do data entry. Other ways that we've tried to address the need out there is by offering to do development when people need something special. So for instance, this project had images that they wanted to be zoomable because they have a really high resolution and they have details that you can't see just in one image. And so they actually came up with extra funding so that Eric could build in a zoomable feature and now that's there and all other projects can use it because it's part of OpenCondex now. We also responded to a need for 3D models and so another project is doing a sort of a test to see how we might link a print publication with the 3D models online. And they also came up with funding to build in this open source rendering for 3D which allows you to change the lighting and new measurements and everything right there in OpenCondex. You don't have to download anything special to be able to look at it. And that's nice to get again because other projects can use it now that it's built. So we're really trying to respond to the need out there and it makes it very diverse but also pretty dynamic. All right, so now I'm gonna move on to the Zork part. So far I've focused on the data creators and how we're trying to use a model of data sharing as publishing to encourage data creators to share more and better data. But for publishing them an impact, the data needs to see actual reuse by a wider community. So now I'm gonna look at the challenges we face in reusing archeological data and I'd like to focus on specialist data and describe issues in our current research practices and how to improve them so that data sharing and reuse can better contribute to our efforts to understand the archeological past. So I'm a Zarcheologist and I have lots of expertise in this area. I've worked not only in the Near East but also actually here at a project at Yerba Buena Island and in the east coast of the US and also in Europe. Comparatively speaking, Zarcheologists are often engaged in multi-site comparative research. And so we have a greater interest in sharing, reusing and integrating multiple data sets. And so to understand the intersection of data creation and reuse needs, we have undertaken several research projects over the past five years or so. And what I'm gonna talk about is gonna range from my own reflections on personal experiences as a specialist, as a data publisher and as a data reuser. And then I'll move on to talk more about a much more formalized study called the Secret Life of Data Project, which involves workplace ethnography and other qualitative research methods. So as I mentioned, as it turns out, Zarcheology is sort of the low-hanging fruit for collecting information about this topic. But Zarcheologists in particular and specialists in general create very useful data that benefits the entire project, but their data often gets siloized and is never fully integrated into the interpretation of the project. And this is for many reasons, often because specialists are operating, they're working at very different timelines than the actual excavation. We often use recording methodologies and strategies that are different from the project or from other specialists. And again, yeah, we usually, we often create really useful data, but it just never makes its way back to the project. So I'm currently working at the site of Morello, Poggi Cibitate, which is in Italy, just south of Siena. It's an amazing area, really a fun place to work. And it's been excavated for more than 50 years with three different directors, the first of which was an art historian. And so as you can understand this, you can imagine there's very different excavation strategies employed over the decades. And surprisingly, for an Etruscan site, they have like tens of thousands or even 100,000 bones that they recovered over this time. So it's actually an amazingly huge assemblage for this time period. And so when I came into the project eight years ago, they just had boxes and boxes and bags and bags of bones from all these years that and some were, you know, that big. I mean, 400 bones in one bag if from one context from 1972. And then now they're excavating and they're bringing these tiny little bags back per context, so you know, three bones and stuff like that. So it's obviously there's very different, there's a comparison issue here with the different contexts. And that's something I was trying to find ways of dealing with. Surprisingly, there are actually some really great research results coming out of this project. I work there every summer and I can only analyze in the field. So that's helping to be able to actually communicate with the project team about what's happening. And because I'm analyzing the old stuff and the new stuff, some stuff is coming to light from what they found before that's helping inform their excavation practices now. One of the issues is that we, from the old bones, we have found a whole bunch of newborn baby bones and they had no idea that this was something that was gonna be popping up. And so now they've sort of changed their excavation strategies to try to be a little more careful about excavating bones to try to get more understanding of where these baby bones are occurring. Because in the past, they were just taken out with all the rest of the bones. So the site is Etruscan. It's mostly from Orientalizing Period which is about seventh century BC and wasn't occupied for very long, like 100, 150 years. So we have basically remains from three different areas. The residents had a lot of remains that indicated that it was some kind of elite area. The tripartite building, we don't know what it was and so it's a ritual area. No, I don't know. It's not well understood and there's not a lot of material from there. The workshop is this huge workshop that has tons and tons of industrial waste and remains and so they were doing some kind of industrial activity there. And what I found is actually amazingly after even over 50 years of different types of excavation strategies, there are some differences across these different areas. And one is in the occurrence of wild animals. So we have some pretty cool wild animals coming up like wolf. There was even an oryx, a wild cow. There's lots and lots of wild boar, all sorts of other animals, in addition to the regular pigs, sheep and cattle that you get. And so anyway, but what we're seeing is that there are differences in the occurrence of these wild animals at the site where there's sort of more scary, sort of formidable animals are occurring in the residents and the smaller animals are occurring in the workshop, perhaps for pelts, I don't know. The ones in the residents, the wolf actually was the left and right mandibles that were together in a context. So they were obviously articulated and maybe were part of a skin. So there's some neat stuff going on there. I think even more remarkable is that early on I started noticing that there seemed to be a portion inside preference, especially among pigs that where there was a predominance of right-sided like forelimb bones, especially the ulna occurring in the residents, the more elite area and the left-sided predominance in the workshop. And this is something that I thought was pretty exciting because people without expertise digging all over those across those five decades would never be able to create this pattern. This is not like they, if you were just picking out large bones and that kind of thing, that would be a pattern that humans could create. But nobody would even know what a left or right ulna looked like unless they had training. And so this is an actually real pattern that's happening across the site. Excitingly, now that we can map it because there's enough data in open context. So this is the residents and this is the workshop. And we're looking now at the right ulna. So you can see that there are right ulnas in both areas. But when you look at the left, you see that there is almost all the lefts are occurring in the workshop area. And this is about 178 data records. So it's a pretty big sample. We also see some differences in rights and lefts with the wild animals occurring in these two areas. Same thing, with more lefts in the workshop and more rights in the residents. And so whether this is something, were they doing carcass distribution? Was there a market? Clearly there was some selection of the right side that was going into the residents area. For what reason, I can't say, but this wasn't just random distribution of carcasses across the site. They were halved and then one half was going one place and the other half was going the other place. So there was prior analysis on this assemblage. This is Mike McKinnon, who is a Canadians archeologist who's worked all over Italy and elsewhere. And he had come to Merlot early on, early on in my time, a few years before I did. And he had looked at just maybe 1,000 bones really quickly just to give him an idea of what they had. And so I thought it would be really nice to be able to incorporate his analysis into this because it's nicer to have two analysts, I think, are better than just one because we can provide some checks and balances to each other. So I contacted him and he was able to actually provide his data to me, which was in written little tables and he transcribed them for me or he digitized them and sent them. So here's just an example of our differences. This is one context. And we could see from comparing our identifications, I got the bones and I identified them as well. And so he was over identifying cattle and I was putting them more into large mammal. We both had dog, we both had tortoise, we both had bird. He had more pigs than I did and then he put sheep and goat into sheep goat and I actually split out my sheep. So these are not huge differences. What it showed is that we have an over-representation of cattle and pigs and I think one of the things that in an under-representation of deer and it's because Mike wasn't at the site for very long because I've been there eight years. I now understand that there's these really large red deer at the site, but it took time like looking at the assemblage to see this and I just clustered them into large mammal, whereas he put them in cattle. So it's just something about once you have more time and expertise with the collection, you start to see these kinds of differences. But for, we did a joint publication on it and it was really good for us to know these differences in particular like he was counting loose teeth. This is why he had more pigs than I did. He counted every single loose tooth as a separate specimen, whereas I put more time into trying to put jaws back together and there were lots of things that belong to the same jaw that clearly came from one animal. But again, it's just that the expedience that he was there for a short time and he just wanted to give them a little overview. So, but this is good to know when we're putting data together. More recently, we sort of formalized this with Hannah Lau who worked on the same assemblage that I did in Turkey. So for 15 years, I was involved in an excavation in Turkey called Domus Tepe, which is in southeastern Turkey. And it's a late Neolithic Halaf site. And I analyzed probably more than 10,000 bones from the site and then she came in, there were a lot more left. She wanted to do her PhD and so she analyzed another 10,000 bones. And I wanted her to be able to use my data so that she could have a larger assemblage to work with, but we had this issue again of inter-analyst variation and how do you actually combine data analyzed by two different people. And so she undertook a study where she found some bones that neither of us had analyzed from the site and she brought me down to LA and we both analyzed the same assemblage and then we compared our results. And in the end, we realized that we had about a 6% error rate where we actually identified things totally differently from each other. And then about a 20% rate of sort of basically the precision of our identification. So I would be more precise in saying, oh, this is a sheep and she would say it's a sheep or goat or she would say it's a medium sized mammal or something like that. So because of her, she had less experience with the zorarchaeology in general, I guess she was being a little more careful about that, which is fine. So this study highlighted the need to be explicit and transparent in our analysis. We ended up recommending that projects make time for small interanalyst studies like this as a way to identify interanalyst variation in their data. Researchers routinely calibrate their scientific instruments to gauge accuracy and measurement, but we largely don't do this in zorarchaeology as we typically don't calibrate our identifications of specimens. We also discussed issues of data documentation and we encouraged the writing of a paradata document to accompany a data set, which describes in detail the context, the analysis, the methods used, and any other information that will help a future user understand that data set better. And this adds more time to our work, but it's critically important for reuse. We kind of see it as like, what if the whole excavation team was abducted by aliens? What would be left? And how would we understand what they'd collected, the data they collected, and how they were doing their analysis because they're gone? I mean, some day we're all gonna be gone and our data will be left. And how will other people use our data? I'm hopefully not a bunch of aliens. So, I'm gonna skip through this study because I'm running out of time. It sort of said the same type of thing. Just to highlight that, we did a study where we produced a paradata document with some other colleagues, Levin, Atitsha, and Justin Leftow were using an old data set like an orphan data set that never ran out. As we produced a three-page document that explained stuff like, you can't read this at all. Anyway, that we basically took out certain bones because of this and that and that reason. We combined these two categories for this and this reason. So we basically justified our lumping and splitting decisions. And it took lots and lots of documentation. This is something you could never share in a regular publication. No one would ever know the choices you made. And so it makes your data not reusable. Furthermore, when we communicate our data, we usually do it in summary tables. And so no one can use that at all. I mean, there's no way you can go back and use any kind of data from that. It's just to present the study being done. So if you could point back to the primary data set and describe how you got to these numbers, it would make for much more reproducible and transparent research. And that study was published, sorry, in the Journal of Archaeological Medicine and Theory that you can go see. And then we did a, we actually analyzed the data set, the old orphan data set and came up with some research results. So we also undertook a study, which I actually, I think we report on this in a talk we gave here at ARF a few years ago. That was a large scale data sharing and integration project that explored the origins of farming. And this was funded by the Encyclopedia of Life and the National Endowment for the Humanities. And we brought a bunch of zoologists working in Turkey together to try to integrate their data sets. And so that we could observe where the challenges were with data integration. And essentially, what happened is that we had some fields that were really easy to align and then fields that were really hard to align and the ones that were difficult to align were actually quite surprising. So things like measurements, tooth wear and fusion data are things that as a zoologist, you would think would be pretty standardized that people would collect similar information. But what happened was people, for instance, everyone used the Angola von Dendrisha's guide to measuring animal bones. That was just sort of the standard everyone had over time adopted for reporting measurements, but it was in the way that they modeled their data that it became really difficult. Everyone used Excel. Nobody used a database. So we all had our spreadsheets modeled differently. And so someone would have, you know, the element listed and then a whole bunch, all von Dendrisha's measurement possibilities as field headers. And then they plug in the measurements along the way that applied to that bone. Some people would put a field that said measurement one and then they'd list the name of the measurement and then next to that put the measurement. So those kinds of things are incredibly hard to link up across spreadsheets without like remodeling the entire spreadsheet. So in the end, we actually, even though this is like maybe the most standard that Zawark recording gets, that was something that hung us up in terms of our comparability of the assemblages. Tooth wear had a similar problem where everyone followed this recording strategy that's proposed by Sebastian Payne for Sheep and Goat Teeth. It's widely used, but again, the vast differences in how different people modeled and recorded that data made it just totally unmeshable. So in the same thing happened with fusion data where people use standard terms, but then they'd add on little notes and stuff that made it really difficult to tell without just glomming them all into like really broad categories. So basically this under the hood exposure helped us lead to better documentation practices that we realized that we saw each other's data and went, oh my gosh, our data can never be compared and maybe I should start doing something more like yours or people had no idea they'd never looked at each other's data sets and seen how other people had modeled their data. So this is really informative and I hope it will move us toward improving our collection practices. So finally, just to conclude, our current project now is called the Secret Life of Data Project and it is funded by the National Endowment for the Humanities and it's a three year project that is basically going out to four different excavation sites across the world and doing ethnographic and interviews and that kind of thing to try to collect information on how issues in the creation of data are impacting the reuse of data downstream. So it's a combination of interviews, field observations, re-user interviews of data re-users that aren't part of the projects and then analysis of the databases that these excavations are creating. And what we're trying to do is improve this area where data creation really does not overlap at all with data reuse and so there's a very small area where they have commonalities. So our observation and interviews on these projects indicated that there were no clear guidelines for how specialists should integrate their data sets with the project database. Specialist analysis often occurs over several years as I said and in off-site locations and those specialist studies and data sets are commonly siloed, bodies of inconsistently managed data and documentation. The siloing often divorces zoarchaeological data for instance from excavation context and impedes interpretation. So greater professionalism within project data management may include the project's expectations for sharing data within and outside of the project, a timeline for specialists to complete their analysis and plans for any data or conventional publications based on the data from the project. And from this study we're also recommending that projects discuss how to integrate specialist data with the primary excavation data set which includes conversations about how to give faunal specimens unique numbers so that makes sense within the broader project. For example, file formats to be used, detailed documentation about paradata such as sampling strategies and identification methods and data archiving and preservation plans. So that is we generally need to improve communications between specialists and other project participants so that specialist data contributions more meaningfully contribute to a broader understanding. So this work is continuing. We're just finishing our field observations like this fall and then we're finishing up a re-user interviews. And so we're continuing also to explore the importance of consistency and identify our management and this also comes from our work with open context where we get a lot of data sets that have no identifiers that are unique and so someone will give us a spreadsheet and they'll have a whole bunch of data in it but when you go to try to work with it and sort it you realize that there are overlapping specimens, there's just no identifiers that identify each individual item and this could lead to a lot of confusion in the future with data reuse. But our research is also showing that data management is not just a technical concern that it involves humans who are involved in all aspects of archeological data documentation and reuse. And so it's important to keep talking to people and not just focus simply on what technology solution is gonna work here. Oh and then finally, we have a cat and we go away for a month every summer and so a lot of people like to conclude with pictures of sunsets or animals. So this is spooky and he needs someone to live with him a month every summer and we live in North Berkeley so if any of you are interested in how sitting in June, July time, come talk to me. He's lovely, very fluffy. Thank you. Sorry. Two minutes. June. Hey man, wow. There's just so many things that hit on this that I would love to. We're gonna go for coffee. We'll go for coffee. But for the first question, which is the one you brought up really early about the incentivizing of this, right? Like trying to, you know, scumbags like me trying to get tenure, right? Trying to figure out where. Yeah. Get the house. But trying to figure out how this, like how's that working for folks publishing as, you know, like to put it in there. I would love to know this. So now that you have tenure and you are on those committees. Well no really, so it's on people who are on those committees. And also if the people who are going up for it have enough guts to ask for it, it's on the work outside of the space. So I, you know, this is not, I would like to encourage this but I realize it's very hard to change systems that are kind of set. But yeah, we have some colleagues who have said I really fought hard for my digital contributions to be acknowledged. And I got totally shot down. Or I fought really hard for it and they actually changed. So it takes sticking your neck out and trying. But that's terrifying, right? When you're going, when you're trying to get promoted. Which is why I think if the people who have been promoted and are now doing the review if they could start to think more about how are we going to value these things? And assess them. That would be amazing. So. See that's the kind of thing. We're doing the thing now where we're trying to figure out how does your community-based scholarship be legible? And what are the things you say when you're coming up? Those are documents that have been crafted by groups of scholars. So maybe there's. Yeah, so like Archaeological Institute of America is coming up with the recommendations for this. I think historical society, age, whatever, they came up with. There are definitely guidelines out there and it needs to be for the people bringing forth their work, but also for the people who are reviewing it, who maybe have no idea how to review this stuff. Because I wonder if that's part of it is that people look at it and go, I don't know how to review this. And so forget it. So if there are people who have tools to actually understand how to assess the quality of these contributions, that would help. So yeah, we need more guidance. We need more articles to have that stuff. Yeah. Well, as part of this last project, we hope to come out with some guidelines like that for different people at different stages to be able to assess digital contributions. But it seems like the best stepping with places like SAA and the AAAs and the AA is having recommendations for how you should deal with that alongside the NSF and the NEH having an necessity that you don't have to deal with that. I mean, that should be a stepping off point right there for faculty. So we're wondering that value. SAA should have a committee. Absolutely. And this gets fed into so we're all speaking the same language. So you don't have to require to need to have a grant. And having a grant is part of your tenure package then. Yeah. And also, it's a systematic way to do it. Well, and the people reviewing those grants also need tools for how to review those things. Like I think data management plans, as far as I understand it, and the ones I've seen as a reviewer, are all over the place. And how are you supposed to know how to assess those if you don't have any experience with it? And so it would be nice if there were examples of, I mean, they're going to be diverse, because every project is really diverse. But there've got to be some baselines that you're not going to, oh, data management is not putting it in my file cabinet. So there's, yeah. I'm wondering about, don't you guys presumably have metrics on how much visitation use your site is getting? And I'm wondering if the granular nature of the open context approach is it showing more use than, say, the grouping of other websites? Like just big files. Yeah. So one issue is that we don't track use metrics, because we don't want to keep people anonymous. They're use anonymous. So we are not following users like that. We go through, like, Gulu-Solar citations. So if you search for open context in Google Scholar, you can see how many places are actually citing things from the context. But we haven't looked at it in terms of individual items or whole projects yet. But that's a good idea to see if people are more interested in this granular citation rather than projects as a whole. Well, that's the kind of information that can be said to the new committees, and the committees, if you say that you know what's happening. Right. And I think maybe the digital data management interest group of the SAA probably needs to get it organized, take a request to the SAA board, for example. They're in the setting up. Yeah, it's a good idea. And task forces. All the time. I've never seen a document like that. A task force to assess the task force. Somebody has to set up a task force this week, and one next week, and one next week. So requesting a task force to start looking at it. Because I think where we are is at a transitional period. I was just talking to somebody on the phone this morning about an evaluation of somebody for a major dean ship, an archaeologist. Why would an archaeologist be a good person to have as a dean beside our multidisciplinary? And how do people who are generations actually, and even younger generations, be the ones that actually says the field is changing, the evaluation is changing, personal cases are changing. You've got to sort of get people in there for that, right? Yeah, definitely. I think it's a transitional place, and so we're still sort of one foot in the old, and some people are getting one foot in the new. I think so too, and I think there's a value to both the carrots and the sticks. So the stick of the NSF saying you have to have data management has made people wake up and go, oh, I have to think about my data. Which is why we've seen an uptake in the people coming to us with actual funding. Because it's finally starting to happen. That requirement came into place, and it took a few years. And now they're like, OK, I'm ready, and I've got $1,000 or whatever. But the carrot of having here your metrics is a great, that's a great incentive to why put your data up there. Because look at all these people that are using it. And yeah, and then you can report that. So we'll look into that. I think it's an interesting issue to bring up in a couple of weeks, when we have France-Corto, the head of NSF here, to talk about broader impacts. Because this is broader impacts. It's a broader impact. So their notion of having broader impacts and data management are not separate at all. So that's where you can fulfill both of those, if they're smarter than that. And then you don't, I mean, like the podcasts and stuff, you don't know even how broad that impact is. It's out there, and it's being used. And it's great, but it's hard to measure, even then. Thank you for your presentation. I believe in New York is personally, some of the issues with that collection are actually in terms of these preferences, even people breaking the kind of the same data. And especially like your comment on the calibration techniques of data collection, for example. And I was wondering, do you think there's something more particular to the story, that you focus on many species? As for my, for example, in paracelotomy. I wonder if this is something that is particular or in paracelotomy, for example, to also focus on calibrating techniques, for example, in whether we have specific population in terms of biological sex, age, or other items in medical profile, and how the cooler the fields of archaeology go about that. I think it's easier for zoarchaeology because we have a taxonomy, and it's Latin names for a species, and everyone uses them. That's really easy. You don't have that in ceramics or other fields, lithics. But so it's really good to use archaeology as a start, to see how we might do this. But I think, for instance, if you work in someone's lab and you learn their technique of analyzing lithics, you've calibrated with them, right? Because you've trained with them. But those approaches aren't necessarily formalized. And so maybe they've published papers and stuff, but there's no sort of standard that they've published that people can turn to. Part of that is that people hate standards sometimes, and no one wants to adhere to someone else's. But if we start to share those more formally, I think we will start to converge on good approaches. But it takes opening up the hood and looking under and seeing all the mess, and that's really hard to share. It's hard to do that, to be worth your colleagues, and then seeing what emerges as best practices. Share your mess, open your underwear drawer. Sorry. I'm just curious of any examples of archaeological data published in your pop-up drawer. So we don't have any there right now, but we have two in the pipeline that aren't ready yet. So they will be coming. But it's a hard one. I mean, maybe correct me if I'm wrong, but the way that you report your data is in these sort of more summaries type tables, right? I mean, that's how you have to do it, right? By bag or whatever. It's published in raw data, it's an appendix. Okay, but it's, what I mean is you're doing it by percentage, right, in a sample. So it's not individual item by individual item, so it's a little bit of a difference. Okay. Well, I think you mentioned- Never law. Right. Never law. You mentioned Clem, my son from Giza, and I think she's gonna enter the different species of plant type she has. Yes, I just cleaned her data, and it was very clean. And she did that. She entered the different species. We've linked them all now to a species list in the web, and hopefully people can start linking across projects now. I didn't get into that today, but the whole idea of linked open data is how we try to get data out of them in context, and to find related data across the web. So yeah, her stuff will be coming really soon. Like, it's ready to go, so. Yeah, because I would love to see examples before other people participate, so. You can email me if you want to, so I can let you know, and you can maybe look at it, give us feedback if you wanna have a look, because we always appreciate what people have to say about how the data looks, and how we can improve it. Thank you. Thank you. Did you have a question, or are you? A standard feedback. So yeah, only if they've given it to us, like for that Chinese Petrography database that's coming up now, we're just putting it online now. He provided a template for how people should submit their data. We would like people to share their ontologies of their naming systems, so that when you publish them in a place like open context, it makes them more formal, and so someone can say, I'm using so-and-so's ontology, here's what I mean when I say black top burnishware, or whatever, some kind of terminology, so it makes it more formal. But no, that's coming, that's sort of gonna happen as people submit their data. We have one project that is looking at, it's Kate Brunson, who's at Harvard, I think, she's looking at Chinese oracle bones, and she's also published her system for zoning the bones, so like a scapula would have all these different zones, and that way she can say exactly where the markings, the burnings and the drillings and stuff are happening on these oracle bones. And that's in Chinese and English, and so someone can talk about this marking on zone D, and it's this kind of marking, she's given an authoritative page that describes that zone and that type of marking, which is really cool, because it means she's helping sort of develop the standard that could be used by others. So that's the way that we're hoping that'll happen, but we don't have any library of those things yet, unfortunately. I'm thinking about that. Oh, yeah, yeah, okay. And they're actively working with material culture analysis, but you're playing around it being very difficult for people to follow standards, and it's very difficult to have people come into that space. And we catalog. Right, and that's Dak's approach, right, is that you re-catalog. Yeah, so, but for instance, their approach, like we actually have some collaborative work going on with them, and we're hoping that we could like pull in their approach or whatever point to it as one of the resources that people could use. So yeah, but that's another approach of integrating content that is related across an area that you can kind of draw a line around. But it wouldn't work for all lithic specialists, say, across the world, because it's too diverse. Anyway, stop. Thanks for staying. Thanks. Thanks.