 Okay, I believe we are now live, but it takes a few more seconds for all the settings to filter through. That closes and then yes. So now we'll wait for a moment. We'll get all the attendees who are in the other session over to this session. Wait on that number to go up, the mid 40s. So that looks like everybody is moving sessions now. So I think we can go ahead and get started. All right. Onward and upward, the fourth talk of the day. Thank you all for being there. It's my distinct pleasure to introduce a really exciting talk from the team involved with the Capture ERC project. So this is Isdouhuvila, Lisa Bergeson, Otis Goldin, and Zana Freiberg of Uppsula University. All of Uppsula University? Yes? Okay. And they're talking to us today on understanding masking in scholarly data publishing and reuse exploratory study of practices relating to securities in archaeological data. So please take it away. Thanks so much. Thank you. So I'll stop by saying that we hope to do a presentation of around 30 minutes and follow that up with some Q&A. And before really launching into the presentation of this exploratory study, we would just like to present some context. So the study we're going to talk to you today is part of the work we're doing in the research project Capture, capturing per data for documenting data creation and use for the future of the future. And this is a project funded by the European Research Council. The switch slides. And as you can probably tell from the title of our project, Paradata is like a central concept for us. And as we understand it is data about processes and tools involved in creating research data. So it's akin to other auxiliary data types like process, provenance and context data. And our overarching research objective is to understand more about how to capture and document enough of these processes to make data truly reusable. And we're coming at this from an information science perspective. And we're looking mainly at cultural heritage data and particularly at archaeological data. And this data could be physical artifacts. It could be drawings, samples, measurements, and 3D models. Why the range of different data sets collected in several different ways by different methods? And this diversity really intensifies the need for explicit information about provenance and processes involved in the data collection. In order to facilitate reuse, you can switch slides again. But it's no easy matter, especially when we come to data publishing. There are no set ways of informing about provenance and process of writing about Paradata. Researcher is always faced with what to add, how much to add. And sometimes adding isn't always better. More information isn't always better. So maybe not add at all or reduce. These are all complexities. And the data are full of complexities that every data publisher is continuously faced with project by project. And the resulting adding worth avoiding to add Paradata is a community act. I wish to make the work understood in a correct way. And this could entail wanting to present the data in a way that feels perfect, that will ensure that it's not misunderstood or misused, or that you will face critique for the way you present the data. And understanding of this act is key, both to understanding the origin stories of Paradata and a work towards better support in Paradata creation and dissemination for the benefit of data usability. And it is in this context, the exploratory study we'll talk to you about today fits into our project. And with that, I hope you have sufficient background in the context, and I will leave the next slide to Michael. Thank you, Summer. So the purpose of this analysis of this talk that we're offering today is to provide some insight into how researchers deal with data-related uncertainties in data publishing practices. And this purpose is met by an analysis of what provenance and process information researchers think about should be presented with research data, what they identify as difficult in producing data provenance and process descriptions, and the strategies they use to work around these difficulties. And I think that the kind of connection between the conference theme and our talk here emerges quite clearly. One of the products of science that have been increasingly digital is research data. And there are presently many drivers that push STEM and SSH scholars to publish the research data that they collect in repositories that are not only for a kind of local project group use, but actually enable kind of widespread, large-scale access and reuse of the data that they have collected. And research data publishing is an aspect of the digitization of science and its products that is becoming more and more kind of ubiquitous and more and more impactful, we would argue. But that also is kind of to some extent understudied and under explored. And because every data set has kind of some degree of epistemological uncertainty, getting a better understanding of how researchers think and act when dealing with data uncertainty in the data that they are publishing in digital repositories is kind of a key step towards realizing the offerings of research data publishing to a larger degree. In this talk, I mean, there are many potential aspects we can delve into and this topic that we outlined so far. But in this talk, our particular focus is set on exploring how the twin concepts of masking and unmasking can be used to understand research strategies reducing or increasing the amount of supplementary information that they add to research data. And what is masking and unmasking in the context of scholarly data publishing them? That's a good question. Well, I guess generally put, masking can refer to such things as lessening the degree of messiness, certainty, and guesswork in the data to decreasing the granularity of detail in the data and to black box descriptions of methods and use of methods. And unmasking would be the opposite thing that is to say adding a lot of process and provenance data to a data set too and that can explain all the details of how data was put into being basically. And masking and unmasking of data can be done in many ways for many reasons with a varying degree of deliberateness. But they signify at its core some kind of territorial action that for some reason hides or obscures or that does not approximate the complexity of published data or when it comes to unmasking does the opposite. That is to say highlight the complexity of the published data. So we have, of course, not invented these concepts. They came from psychology originally, I think. But let's delve into two brief examples of how masking and unmasking have been explored in the literature. So here we've got a quote from Turko's book, Evocative Objects, Things We Think With. And Turko uses this metaphor of the front room and back room knowledge to explain the nature of scholarly knowledge production. What is claimed here in this quote is that the back room is pretty messy, a chaotic space. While the front room knowledge and front room knowledge here would be things that we put in papers and books and chapters and publish. Or tidy. But the argument here is that front room knowledge is constructed and that we make a clean story to mask our anxieties about the chaotic state of the literature. Another example here that is kind of less closest to the archaeological data that we are exploring is taken from this paper by Ulla. And they talk about data sets that are old and that many researchers have been using and adding to under kind of a long period of time. So masking here would be, for example, to quote this, subsequent reprojection of coordinates that were initially reported as surrounded by values would mask the fact that they were rounded originally. Yeah, so I give the mic to Lisa. We'll talk a bit about our research design. Yes, thank you so much. We are currently conducting this interview study. And our goal is to include 30 archaeologists covering as many subspecialties as possible. Everything from GIS to music archaeology. The types of expertise present in this subset that we are analyzing for this pilot is classical Mediterranean archaeology, archaeological remote sensing, landscape archaeology, and archaeological data integration. And these are the self-reported expertise of the interviewees. So you see that we are already quite a range. And these are all researchers that in some capacity work to produce, integrate, make available, or use research data as a part of their research work. And they can also have other appointments, such as project manager of database projects or teaching database construction or such other roles. We have analyzed these interviews in vivo and using a ground-up strategy or method to identify what these interviewees experience as challenges in working with data and how they deal with these challenges. And so we're still in the process of grouping and naming our categories, so we're looking forward to have your questions and your input on this. Analysis that we're presenting here. Next slide, please. Yes. So what do we mean by data complexities in this case? Well, we already see quite a variety of different data complexities that researchers deal with. Some of these relate to data content. For example, if you wish to integrate data generated over multiple decades, there might be that the data from the mid-1950s, you have the data, but you don't have any explicit methods description, so you don't really know what the observations represent. But you still want to use it to integrate it to get a full or a large data set to analyze. It can also be the state of data that's complex or uncertain in some way. It can be that you know that there are, in fact, a certain number of investigations undertaken in a certain region, but you can only find, when you search, you can only find that a selection of these has been reported. So you understand there's a backlog somewhere, but you don't know where it is and how to locate it. It can also be such things that we heard in the earlier talk today, questions about terminological uncertainties, that there are terms that are perhaps broad descriptive terms, and you don't really know why they use such broad terms. Is it because there was a lack of time when they conducted the investigation? Was there not enough evidence to pinpoint more narrow find type or time period, et cetera? So in the original title of this talk, we use the term data obscurity, and then along the process we have also used the term data uncertainties, we have arrived that using complexities is a better way of describing all these different kinds of challenges that puts the researcher in a situation where we have to pick a strategy how to deal with it. And just to give one example of such strategy that is sort of makes me smile at least, is one of our interviewees, this person had a strategy that she said anyone actually working with research data knows that it's imperfect and incomplete. And this is sort of a strategy to like, this is how I think about research data and I assume everybody else thinks about research data in this way as well. So it's sort of a generalizing acceptance of complexity. But as we will see now that we proceed to the next slide, we have identified several different strategies for dealing with data complexities in our interview material. And naturally, as we are information science researchers, we look at information and we look at do these strategies that researchers employ result in added information or do they not result in any added information? Like in the background, we're thinking about data descriptions and metadata. That's what we are sort of oriented towards. So when we look at the strategies that result in added information, we see for one part, traditional strategies like increasing metadata, adding textual scope notes of metadata categories, or adding literature references in metadata, for example, like expanding the traditional metadata scheme, also turning to set formats such as data journals to explain more about the methodology. Then we also see some more free form ways of dealing with data complexities, like signposting processuality or incompleteness, for example, on a project main page of a database. So the idea then is that the database user will first go to the project main page before they access the database and they will read on the main page, what's the status of this database and then we'll hopefully remember that when they start looking into the database and extracting information or extracting data for their further use. It can also be such things as trying to explicate and verbalize passive provenance and process knowledge in different forms and formats and also versioning, creating libraries of versions of a data set, for example, in GitHub or other repositories. On next slide, we have an example of one of these free form strategies for explicating a process. So this is an archaeologist working out with multivariate spectral data to identify components in rock material. So he's explaining to me that, well, there's a file on the pre-processing and I portion each step for what I do and I have a more general description of what I do and then I also save the scripts that I'm running and I'm combining these together into one narrative. So what he basically is doing is he's telling about what he is doing and he's telling about what he is telling his machine or his equipment to do and then he's combining this into one narrative that he hopes will inform an eventual data user. So next slide, please. Then we have, on the other hand, different strategies for dealing with data complexities that does actually not add any information and one such strategy was, as I mentioned earlier, just accepting complexity. Another is relying on someone else's knowledge, on a data scientist's knowledge or on re-user expertise. So thinking or arguing like, I don't need to explain all this because I know that whoever uses this data will understand. There are also different practical strategies like changing or limiting focus to a less complex data source, postponing data description tasks or reducing or avoiding specificity in data descriptions and also we see that there is sometimes a hesitance to present data and also tendency sometimes to set unattainable goals for data quality thresholds for publishing. So I will publish this, but not until I have quality checked it for it will be 100% quality checked. That's what I'm going for. So next slide. So this is one illustration of what a strategy not adding information can look like. So it's a database where they have worked meticulously to create one of these 100% perfect databases. But then they have overlooked referencing to where they got the numbers from. Did they get it from lab reports or did they get it from second hand sources, second level sources like reports? So they have overlooked a certain type of specificity. Next slide. So the strategies that does not provide any information could be seen as we mentioned earlier. It could be just interpreted as sloppy research practice and not being explicit enough about what you're doing. But when we interpret all these strategies as communicative acts and put them into this unmasking grid, we see that, well, there are actually strategies that doesn't add any information to the research data that can be understood as communicative acts in the sense that they are reducing complexity to communicate the data in a more effective way. We also note that what counts as masking or unmasking in a certain situation would probably, we would probably need to interpret it in its context because such an act as explaining a particular process or method in a data journal, for example, if it's done the first time as an innovative act and a revolutionary sort of first time description, it can be counted or interpreted as unmasking while if it's done with copy-paste method, this is always how this method is described. We're doing this over and over again as a disciplinary standard. It's likely to be more of a masking strategy. Next slide. What we also see when we interpret these strategies in the unmasking typology grid, we see that certain strategies of data avoidance fall outside of this grid and are actually strategies that steer away from communicating about data at all. So examples of this is like when you're changing or limiting your focus to a less complex data source instead of diving into and verbalizing what's complex about this data source that you were looking at in the first hand. This is a really interesting case, we think, the instances of data avoidance. And we'll pick up on that later on. Ulle. Thank you, Lisa. Excellent stuff. Yeah, so just a brief reflection then on our approach. I'm trying to say something about how people deal with the data uncertainties in the context that they're publishing and reuse using this masking and masking because to me, we found that this dichotomy is pretty useful for interpreting a number of strategies for dealing with data complexities. And these result both in, of course, added information and then those not adding any information. One limitation of this approach, as Lisa said, is that this dichotomy doesn't actually cover these strategies of data avoidance. And it's likely important to going forward and trying to understand how we further can understand these strategies. So basically, the analysis shows that data avoidance is something that is likely worthwhile to pursue for the studies. Important questions to ask can likely be something along the lines of what happens when data is avoided in publishing or reuse scenarios. What are the consequences of avoiding data in this context and what are the reasons underpinning its occurrence and motivating people to avoid data in this sense? Yeah, exactly. I think I move ahead a bit now for the reference to the time. So there are, of course, many interesting things to pick out and put the spotlight down when it comes to this presentation here, or at least what we think. So I'm going to try to delve into a few of them. That seems most interesting in this particular arrangement here. So one thing we could talk a bit about is the social productivity of masking and unmasking in the context of science. Turko, the person that we referred to previously in this presentation, writes that all published or finished knowledge products are basically, to some degree, clean, and that its certainties are constructed, or at least that its uncertainties are, to some degree, hidden. And I think that this leads us to think about what existing masking and to what extent that masking and unmasking strategies are integral parts of commonplace ways of doing science, creating data, analyzing it, and publishing it. For example, can the lessening of complexity of a data set that would be masking, can it be considered to be some kind of a requirement for any type of epistemological work where results are obtained and reported? And something that would be interesting to think a bit about also is to what degree that masking is actually a detriment and a key component of effecting publication and reuse strategies of research data. And it would be interesting to think a bit about what the threshold is, that the term is one kind of productive uses of masking instead becomes illicit once, and that is to say that you kind of strive to hide the limitations or weaknesses of a data set. And something that we have arrived here also is that when kind of thinking a bit about masking and unmasking going forward, it would be potentially kind of useful interpretive access to use is to think about it as an area that is affected by kind of people's illicit purposes, that is to say kind of hiding uncertainties or weaknesses in a data set, productive purposes and uses of masking, and this would be kind of rendering complexities workable and communicable and reporting on the data set. And these two kind of points would then intersect with technical or kind of infrastructural requirements of data publishing, and this would be, for instance, certain data repositories that use preset metadata schemas that can actually be adapted to the specificities of the data set that is being published. Another thing about that is important to consider here is, of course, disciplinary and domain perspectives because science is done in very different ways in different disciplines, and there are surely kind of different ways of dealing with uncertainty by masking or unmasking in different disciplines kind of across the SSH and spectrums. Moving ahead to the conclusions here, we would like to say that it might be kind of somewhat straightforward at least to identify how researchers deal with data-related uncertainties, but it's much more difficult to figure out the role of such activities in kind of the larger realm of digital science. And masking is done with many in underpinning motivations, and I'm asking as well, and it's done by using many different types of kind of approaches and mechanisms. So one question we could ask is kind of to what extent is it feasible to use a single theoretical dichotomy such as masking and unmasking to explain data-related uncertainties. And what we found was that I'm thinking about that is that these concepts, this dichotomy of masking and masking, it's useful to understand data descriptions, how they kind of fulfill different types of communicative intentions and that this dichotomy is kind of potentially useful to explain how data makers and data publishers respond to and interact with regulations, guidelines, and for instance, schemes for data descriptions. And also that it's important to kind of take into account the designing process and method of the schemes and the data sharing systems, how to design these for dealing with data complexities and uncertainties and these strategies of masking and unmasking and unmasking that we know are pretty prevalent in different ways of doing science. And I think I'm going to leave the closing to Sanna. Please go ahead. Yeah, so the next steps for us are within the projects many of course, but for this particular explorative study, we're working towards completely interview study and as both Ulle and Lisa touched upon, delve a bit deeper into strategies for dealing with data complaints that is beyond masking and unmasking, looking at ways and developing analytical tools to analyze such things as data avoidance, for example. And I would also just like to say that additionally, we are conducting a survey on making and using archaeological data. So if anyone in the audience happens to be an archaeologist or no one, we would be very thankful if you would have a look at the survey, maybe complete it or just tell someone about it. We've included a link, but it can also be found on our web page. And with that, we can take the next slide. Where I will extend our combined thank you to everyone in the audience for listening to our talk. And as you have probably noticed, this is exploratory work and we would welcome any input or ideas that you have and are also happy to answer any questions about both this study and our project. So thank you. Wonderful, thank you so much. Questions pouring in. So let me start with one from Stefan Hesbury and who asks, there's an ongoing discussion on using things like Python notebooks for reproducible research. And I wonder to what extent that kind of conversation gets picked up in you guys work and in the leaning in the more digital realm. Well, again, I will stress that this is, we will have a different answer when we have the full data set. But some, like archeology is a discipline where there's quite a lot of researchers who themselves develop programming skills and they are more keen on using those types of technologies for their documentation. But there's also a really strong documentation tradition coming from within the archeology discipline that will not easily give room for new technologies. They will not just hand over the responsibility so to say to new technologies but it has to go together with the established traditions in the discipline. And this also has to do with the fact that archeology is in many countries also quite strictly regulated and bounded by the need to report findings and results to government authorities in different forms. So, yeah, would anyone want to add any other answer? Perhaps just a brief interjection in the kind of realm of archeology which focuses on creating 3D renderings in models of sites and different types of archeological environments. There's been this pretty intense kind of discussion about how to capture the making of these 3D renderings so that people can actually to some degree can reproduce the interpretations and the techniques that were applied to training the final product. So that might be some kind of example that mirrors the one mentioning the question to some extent. Great, the next upvoted question is mine so I'll take the opportunity to self-indulge a little. Have you been able to get an idea of which kinds of publics your various interviewees have in mind when they think about reuse? So I know this is obviously for archeology especially, it's a much more multifaceted question than it probably seems at first to those of us who are less familiar with archeology because this might be a real driver of strategy choice. So what kinds of reuse scenarios are to do the various people have in mind here, do you think? So well, as we mentioned, archeology is a very, very varied discipline. So there are both examples where the reuse imagined is other researchers wanting to do studies on the same region, for example, in the Mediterranean area somewhere, so regional interest. But there may also be, since archeologists integrate their interests with so many other disciplines, there are, for example, archaeologists focusing on volcanic ash data. So they are directing their work towards the geophysical space and discipline and again, different topical publics, so to speak. So there's a real range, or certain materials like ceramics, like the ceramics community is a huge thing. Oh yeah, please, sorry, go ahead. Very briefly, in the interviews that I've met, please, the only time that the publics that were considered when publishing data were something else than other archeologists within basically the same kind of area of archeology was in the public find scenario. And these are repositories where if you find a sword in a lake, you send it to the public finds office and then they document it and then they are not only geared towards other archeologists, but rather people inhabiting that part of the country in a more general way. That's really interesting, that's cool. OK, next up, question actually, two related questions coming in, a question from Beckett Sterner who asks, so do your interviewees talk about the value of masking and unmasking in connection with data or modeling standards? So he says here, he's thinking about Miliran's work on making private troubles into public issues as part of the ecological metadata language. So how do data or modeling standards interact here? You'd like to start or let? Yeah, I can start, I think this is a very good question and it's very complex, and I think the people that we talk to also consider this to be very complex as well. So I would say that my impression is that standards are understood to be something that is very, very useful in a general sense, but perhaps less useful when it comes to the particular data set that people are actually talking about. And there is definitely this kind of discrepancy between what people are willing to do in terms of curing, curating their data set to kind of fit the standard and to what it's kind of actually done. And this kind of connects to this example sometimes to kind of the idea of kind of perfect data publishing approach actually comes in the way of actually publishing the data. So that's how we can kind of look at it. Lisa, do you have anything to kind of tie onto that? Well, it's a little difficult with the interaction here and clarifying the question, but really something that I could add is that sometimes archaeologists, sometimes they take help from other people doing sort of the data modeling work. So sometimes it's lack boxed in terms of the division of responsibility between different persons. So the data scientist knows what we did with the data and I know what we know based on the data. So that's also sort of how standards play into how things are done with data in this type of work. So that's actually a nice segue into the last question that I have here from Sarah Davies, who has a super interesting presentation. I wonder if there are parallels with data practices in other disciplines like, for instance, in bioinformatics or bio curation. So have you looked into comparisons with other fields and how do you see that working in your project? Would you like to think that one, Listo? That's the project lead? Yeah, sure. I can obviously do that, yeah. Our focus in this project is on archaeology and that's kind of a sort of an empirical lens to look at data reuse and data documentation practices. And what we're going to do in the future is to do a few excursions to other fields as well. At the same time, we chose or I chose archaeology for the reason that it's a certain kind of an interesting field in a sense that it's so interdisciplinary. So we can kind of get part of those other fields as well. We're working with people who are interacting with science data and very kind of textual humanities data and so on. But kind of comparing to what we know about other fields. So yes, there are definitely many similarities, although there are also many differences and there may be the kind of the differences. What is key to many of them is precisely the kind of the interdisciplinarity of the archaeological research enterprise in a sense that with a certain kind of data you are a part of a specific community, but then as a part of kind of when you are using the data in an archaeological context, so then you have to take into account other archaeological communities. So it's sort of a kind of a balancing act between different communities. And that's quite apparently something that pertains to other fields as well. But there are similarities and kind of people deal with data in similar ways. But then at the same time, there are differences depending on the context and situation. Fantastic. Great. That's all our questions and we're full on time, so thank you.