 Today we're going to be speaking about provenance and social science data, so you should be able to see on our screen we're showing our data provenance community page and we have a data provenance interest group and if you're interested in that you can contact us through the contacts on that page. We have our speakers here, I'm Kay LeMay, I'm from ANZ and I'm one of the research data specialists at ANZ and we have George Alter, Stephen Gaghan and Nicholas Carr. We'll give each of them a little bit of an intro when we get to their point in speaking. So as I mentioned this is part of a series, today's our first one, so I'd like to introduce Steve and Nick who will be speaking first. So Steve is the director of the Australian Data Archive at the Australian National University. He holds a PhD in industrial relations and a graduate diploma in management information systems and has research interests in data management and archiving, community and social attitude surveys, new data collection methods and reproducible research methods. Steve has been involved in various professional associations in survey research and data archiving over the last 10 years and is currently chair of the executive board of the Data Documentation Initiative. And Nick, Nicholas Carr is the data architect for Geosciences Australia, GA. In that role he designs and helps build enterprise data platforms. GA is particularly interested in the transparency and repeatability of its science and the data products it delivers. For these reasons, Nick implements provenance modeling and management systems in order to represent and store information about data lineage, what was done and who did it and what they used to do it. Previous to working at GA, Nick was an experimental scientist at CSIRO and researched metadata systems, provenance, data management and linked data. He currently co-chairs the International Research Data Alliance's Research Data Provenance Interest Group, which the Antiprovenance Interest Group works with and through that and other groups assists organizations with provenance management adoption. Okay, thanks Kate. All right, so this is a very quick introduction to PROV. So PROV is a provenance standard and what you see on that first slide there is a very, very simple diagram of a little provenance network and I'll discuss some of that as we go. So it's not just a frivolous diagram, it's actually has some meaning. Okay, so the outline for today. So what is PROV? I'm just going to mention that quickly and then I'm going to get to how do I actually use this thing in a couple of different ways and so first I'll talk about modeling, then I'll talk about how do I actually manage the data once I've collected or made provenance data and then I'll talk about using PROV with other systems. So what is PROV? So PROV is a W3C recommendation. So W3C is the worldwide web consortium. So it's one of the governing bodies of internet standards and they don't issue any documents called standards. They issue documents called recommendations. So PROV is a recommendation. It's its top level of standard I suppose. Other standards by the W3C are things like HTML. So I'm sure everyone's familiar with HTML at least to some extent. PROV itself was completed in 2013 and sort of formalized by the end of that year. So it's only a couple of years old and a large number of authors were involved in PROV. There were several initiatives to make provenance standards before PROV over the last perhaps 20 years and many of the authors are involved in those standards such as PML and OPM. I'm not going to elaborate all that. So if you're interested in those, quickly just Google them. Many of the authors involved in those initiatives were involved with PROV. So PROV really does know about those other initiatives and it's simpler than those precursors because it's trying to do a sort of a high level standard. It doesn't do as many of the tasks that those precursors do but it certainly represents the very important bits that they come up with. Another thing to say about PROV is there's no version 2 planned anytime soon. Why am I bringing this up now? Well it's a pain for people to have to deal with standards and in version 2s and 3s and 4s of standards. PROV doesn't quite operate like that and I'll explain how. It is what it is and there are ways to extend it and use it in different circumstances but it's unlikely we're going to see any version change in the next few years I would think. And it's seen good adoption. PROV is really the only international broad-scale provenance standard and as a result people are happy to I think happy to adopt it and move really anything else. So PROV is actually a collection of documents and I've just listed them there. I'm not going to go through them all in great detail but there is an overview document and then certain bits and pieces which are actual recommendations or standards and additional things that just help you use the PROV thing. Now the main document is the PROV DM the data model and that tells you what PROV contains, how its classes operate and so on and then there's a series of documents like an XML version of PROV and our ontology version and special notations and so on. The only other one I'll mention is the PROV constraints which is a list of things of rules that PROV compliant chunks of data must adhere to and that works across any formulation of PROV and I'll provide a link there to the collection of standards of documents. All right so how do I use PROV? This is a this is a modeling. How do I actually model something using PROV to do the core of provenance representation? Well I'm starting with some negatives so don't do it like this. Don't take a document for something perhaps a metadata catalog entry and expect to shove a bunch of information into some field within that document. So ISO 19115 is a standard for spatial datasets and it's got a field called lineage and some people expect to take PROV provenance information and stick it in that lineage field. Don't do that. PROV doesn't let you do that and I'll explain why in a second. So that's one thing not to do. We're not going to see a single items metadata record containing a bunch of provenance information. You could do that but not recommended or else should I not do. So this diagram here is the class model of the DCATS the data catalog vocabulary which is a very generic metadata model. It's used in relation to things like Dublin core and various catalog style things and we're not going to link a dataset or any other object in DCATS or Dublin core or other standards like that to a class of provenance information and this is true for Steve's DDI initiative as well. We're not going to take objects in DDI and link to a provenance object that tells you the provenance of that object. That's an anti-pattern right there. Okay so what are we going to do? Oh and we don't even do this using Dublin core's provenance properties. So Dublin core vocabulary has a property called provenance and the wording for that says use this to describe lineage and history. PROV doesn't want you to do that exactly like that. What does PROV want you to think of everything that you're interested in in terms of three general classes of objects so is the scenario and the things that you're interested in are they things are they entities are they occurrences are they processes are their activities or are they causative people or organizations which PROV calls an agent. So PROV says model everything you know about using those three classes and then link them together and that's what that's what PROV is all about. So how does GA use PROV? So we often process chunks of data at GA and so we have a very simple model that's using the provenance ontology and it looks like this there's some process the process generates outputs the outputs are entities the process itself is an activity and then there's data and code and configuration and so on that feed into that process and those are also entities. Finally the process and the entities might be related to a system and even a person who operates that system so that's the model we use. Okay so how do I actually manage the data that I get in provenance or that I get according to PROV? Well you can create reports so if you go and do something a human or a system could log what they've done and they could store that information in a some kind of database according to the PROV model and then you can it's a document database but you can you can query that thing so you we often have systems that sort of send reports every time they run and you might have a form that looks like any other metadata entry form where you fill in details and you hit enter and that sends off provenance information but again it's not storing it with respect to one specific object it's linking existing objects together so some dataset that is produced from another dataset it's going to link those two things together. For catalog things we can link things again and if we have a catalog that has a data set A or X and a data set Y and we want to show there's a linking we can say dataset Y was derived from dataset X and record that information somewhere. Now dataset Y may record I come from dataset X but that's just a very simple little bit of provenance information it's not a whole blob of provenance information stored within dataset Y and we can ensure that any system that has information that is provenance information like who the creator of a dataset was does so in accordance to the PROV model. So in this case if we had a dataset that had a creator we would say the dataset was associated with an agent and the agent had a role to play and that role in this case was creator that's that's now a PROV expression of that relationship and for databases it can be very difficult I can't explain it in depth here but there's many ways in which databases could store provenance or PROV related provenance information but they would need to be able to show that they can actually export their content their provenance content according to the PROV data model you actually have to prove that if you want to say that you're compliant with the standards okay so fairly quickly how do I get PROV to work with other systems well we can fully align our system whatever the system is so I've used a theoretical example of metadata system X how do I align metadata system X with PROV I could classify all the things in metadata system X according to PROV requires a metadata model for metadata system X sorry a data model not just in coding formats we can't just deal with XML and so on we actually have to have a conceptual model and then we can say this class of thing in metadata system X is the same as this class of thing in PROV now PROV's only got a few classes so that's that's usually pretty easy to do but it will definitely prompt you to do things that you wouldn't normally do you may have to tease apart some of the objects that you know and love into things that PROV recognizes as different objects you could do a partial alignment you could take your metadata system X and only acknowledge that some of the things in that scenario are PROV understood things so maybe you've got a metadata model that talks about all kinds of stuff and one of the things it talks about is a data set and you say your data set is the same as what PROV thinks of as an entity and maybe you ignore all the other things you would still need to demonstrate that you could extract valid PROV out of that and not all the other stuff but that would be one way to do it and you could also lead to things not in your own data model if you also classify those things that call you to PROV. The last scenario you could think about is to just deprecate your obviously not as good systems and use PROV and that would require you perhaps to make either a new data set of PROV's information or a data store and put that information somewhere that's it. Thank you very much Nick so we'll move on to Steve. Nick's talked about the sort of the general PROV model that has increasingly getting used in various different spaces. I'm going to talk specifically about the various ways of thinking about PROV and it's in what we're doing in the social sciences and particularly using it with the standard that we utilize and I'm now the director for the data documentation initiative. Part of the reason we've sort of connected these two together is we're now looking at how we can leverage the PROV standard inside DDI in point of fact so Nick and I and a group of others have been working on how we might go about this. I'm not going to touch too much on that but I'll return to that at the end. I would sort of want to talk more generally about how we might think about PROV in at different stages in the data life cycle or different stages in the researcher and the data management experience and how we progressed thinking about PROV and so you know over that time just to give a sense of what sort of things we can do already and how can we increasingly you know embed PROV capture of PROV in what we do. Okay I'm quickly and for those you don't know the Australian Data Archive we've had a various names over time we're going to do a quick introduction. We've been around for a little while now basically at the Research School of Social Sciences at ANU and our mission is to collect and preserve Australian social science data the social science research community in Australia and internationally and we've sort of developed a collection of you know over 5,000 data sets now over 1,500 different studies as we call them or projects lots of different sources lots of different provenance from various different locations academic government and private sector so as our holdings have developed our understanding of provenance has developed you know probably alongside that maybe we didn't call it that at the time but you know over 35 years I think that's always been sort of underpinning a lot of what we've done and is helping researchers we might be the secondary users of our data to know where did this come from what was it used for and you know how might I use it in the future this is really emphasis there. For those who don't really know what we're talking about when I use the term data archive we're using the term entrusted systems out of a project done by the the social science human research council of Canada kind of the equivalent in Canada for the ARC accessible and comprehensive service empowering researchers to locate request retrieval and use data resources so you're going to be able to find and understand it in a simple seamless and cost-effective way at the same time protecting privacy confidentiality intellectual property rights of those involved and part of why we're interested in provenance is really that last point you know one is to help researchers understand where this came from but B is to sort of recognize and acknowledge the intellectual property that's been developed in those resources over time okay so I'm going to give a brief introduction to the DDI standard and it's different flavors as Nick pointed out having multiple versions is not always much fun we're up to version four we're about 20 years old now so I think that's not too bad next and how we've sort of captured what we might think of as you know different forms of provenance over time so I've got the website there the ddielights.org website if you're interested in knowing more you can go and explore the different versions of the standard there so what is DDI? I'd say it's a structure metadata specification developed you know for the community and by the community so particularly the social science data archives that exist in most OECD countries it's used in about 90 different countries around the world now thanks to work by the World Bank and the World Health Organization and others and there's two major development lines that are basically XML schemas one's DDI code book and the other DDI life cycle which kind of correspond to version two and version three of the standard and I'll talk a little bit more about those in a moment we have some other elements to it as well additional specifications including some kind of pro vocabularies often for things like encoding methodology, type data types and data capture processes and some RDA vocabularies so that we can sort of start moving into a linked data world so you can leverage the standard the life cycle standard into a linked data environment the version four is in development the moment has been at the last couple of years and that's where the work with Nick has sort of come on board as well it's moving to a model based specification so rather than being based in a particular schema we're looking sort of to focus on the model and then its expression into various different formats the provisional ones at this point are XML and RDF and that includes support for provenance and process models so we're looking at that point at the how do we leverage what we know from prompt to support the provenance model within the new version of the standard and it's managed by the DDI a lot. So briefly on the two versions of the standard that are already in place so it's been around in the code book format which has its origins in the print code books that were produced by organizations like George's going right back to the 1960s and 70s so we sort of formalized in the social sciences a fairly structured way of thinking about describing data about 40 years ago really so the code book version of the standard really is for an after the fact description of what this data set a data set is about includes four basic sections the document description which is describing the document that's describing the data set a study description we use the term study to describe sort of the package of data sets that encapsulate a project so that includes characteristics of the study itself that the DDI is describing that includes lots of you know sections on authorship citation access conditions but particularly from the point of view of provenance we have their methodological content data collection processes sources and then we also include a lot of what we call related materials documents associated with the project we tell you something about provenance of where it came from this includes all the questionnaires previous code books technical reports etc so from a human point of view you're starting to get into the area of you know thinking about provenance even though it's not really a machine actionable version of that we also describe the files themselves the characteristics of the physical data files data formats etc the size and their structure and then what we call variable descriptions descriptions that the variables that are included in the data file the simplest way of thinking about this is you know the columns of a tabular data set what does that column mean because in a lot of the social sciences a column a number is not represent actually a number represents a characteristic of some sort for example a five point agree disagree scale in a survey how do you interpret a lot of know is becomes important and George is going to talk to a specific project looking at how we do a lot more with the variable description of the properties of variables in a moment the so the code book was was really described described things developed to describe things after the fact the DDI life cycle model takes a more data life cycle approach to thinking about capturing metadata and prominence and underlying it is the the model we have on screen here I mean this is a working model of describing the different processes in the DDI frame that a data set can go through everything from conceptualizing the study in the first place through collection and processing and distribution as a side when archiving that data and storing it away for future use and then rediscovery and analysis and repurposing into the future so it was built with the intent of reusability and particularly machine action ability as well so that the metadata that's developed in a data set can be reused in the future for the same purpose a similar purpose or something entirely new and in order to do that you need to be able to understand where did they come from so embedded in that is generating metadata going forward be able to look backwards to the life cycle as well so as I said it is focused on metadata reuse and that reuse of metadata really implies you know a provenance and you know expectation so what why do you know life cycle the things you can do it's machine actionable it's more complex there are 27 different schemas it's probably overly concept complex if we've been fair it's structured and identifiable so every metadata item is actually permanently identified managed and repurposed if that's required it supports related standards and it supports reuse across different projects and again that's sort of something that George had talked on as well I'll move past this because I think there are some particular features for it that I can refer back to in the future but I want to talk very briefly about how do we think about prominence within the different versions and then past George to sort of talk specifically about one of the projects there so if we think about how prominence has been supported here I mean Nick's approach to the problem with really a machine actionable model fundamentally. DBI Codebook is not really designed for that but it is designed at least to be able to describe to a human reading a catalog entry what's you know what the prominence of this dataset was so it includes attribution methodology data processing collection and all the documentation we can find on what happened to the data but it doesn't really do that you know sort of an automated way it's really focusing on you know a human response you know to forever search to be able to come back and have a look. Similarly with variables question text with weather name what the vague labels mean are all there. DBI WifeCycle is really trying to you know is that it was our first attempt really to look at and sort of the machine actionable prominence so can we capture this along the way or represents again the information from the studies attribution methodology and so forth but particularly with variables is really trying to look at the reusable elements of how we might reuse questions reuse columns of data and understand and reuse that the basic conceptual ideas that are embedded within that. So for example if you've got a variable measuring employment can I reuse that employment maybe the categorization that was used the numbers that were used in the survey and so and then where we're going with DDI for our sort of our tagline for data what we're calling is DDI views is can I you know to what extent can I actually embed a provenance model inside that framework so now we're moving towards you know really recognizing the importance of provenance both conceptually and in sort of the physical and digital formats of data as well managing codes and categories across the lifecycle for example managing what you know provenance through missing values if your value of a data changes how do I understand that so that we've got really we're able to generate this out automatically what happened to it you know at the level of an individual datum of a variable or of a data set so we're moving progressively towards the sort of framework that Nick described but that requires that the management of the metadata that we have to be moved forward that's part of it from me. Hi everyone thanks very much to Ant and to ADA for inviting me to be here. What I'm going to talk about today is a project that started in October with funding from the US National Science Foundation about capturing metadata during the process of data creation so I don't think for this audience I have to go into this make justified metadata but the the big problem that we face is how do we actually get the metadata it that's often more difficult than then we you know it's a lot easier to describe it than it is to get it to actually get it most of the time so to give you some backgrounds I'm going to put this in the context of my home institution which is the Inter-University Consortium for Political and Social Research located at the University of Michigan we've been in the business of archiving social science data since 1962 and we're an international consortium of more than 760 institutions we were also one of the founding members of the Data Documentation Initiative Alliance which Steve just talked about and we actually provide the home office for the DDI Alliance and ICBSR has been using DDI for many years but we're now getting to the point where we're able to build new kinds of tools that take advantage of DDI one of the first things that we've been doing which we've been doing for at least 10 years is that when you download data from ICPSR you get with it a codebook in PDF but the PDF is actually created from the DDI not the other way around so for us the DDI is the native version of of the metadata so what we've been started to do is to take advantage of DDI to build new kinds of tools one of the first ones we created was what's called our variable search page where you can put in a search term and look for questions that have been used in data sets that are like that search term so this is an example of the results that come out of a variable search and we are now searching over more than four and a half million variables in about 5,000 studies or data collections one of the things that DDI makes possible is that we can go from this search to other characteristics of the data so you can see here in the blue that there are a number of things that are hyperlinked if you click on the place I've got circled it takes you to an online codebook and the online codebook has a number of features it tells you the question that was asked it tells you how it was coded if the data are available online you can go to a cross tab tool and it also can link to online an online graphing tool and the other thing that you see on the left side of the screen is a list of the other variables in the data set so you can move around in the data set and clicking on on any of those variables will bring up a display that's similar to this another thing you can do from our variable search screen is that if you click on the check check these check boxes on the left you can pick out a certain number of variables that you want to to look at more closely and clicking on this the compare compare button at the top there brings you to this screen which is a side by side comparison of these different variables which come from different studies and so you can see whether they're asking the same question whether they're coded the same or differently and as before this screen is also a hyperlink to the online codebook so you can go back and forth one of our more recent tools which I think is is one of the most powerful is that you can now search for data sets that include more than one variable that you're interested in so this is a search in using what we call our variable relevant search that's actually in the study search rather than the variable search where we're looking for three variables about three different things does the respondent read newspapers do they volunteer in schools what's their race and you can see here that the results come out in three different columns within each study so you can see which variables are present in each study and as before everything is hyperlinked to to both the online codebook and the variable comparisons so you can check on any combination of these variables and compare them side by side well another thing that we did as another previous NSF project working with the American National Election Study and the general social survey we made a crosswalk of the variables that are available in those two studies now the American National Election Study started in 1948 and it's done every four years the general social survey started in 1972 when it's done every two years so we're actually going to be looking over 70 different data sets and what we've done is created this crosswalk where we've grouped the variables according to certain tags we've got eight lists of tags and then 134 tags in total the columns here each column represents a data set and there are 70 data sets all of the variables are linked here and I can't actually show it here but if you hover over one of those variables it shows you the question text for the variable and again you can use the checkboxes to pick out things that you want to compare and go to the variable comparisons screen so this is a crosswalk like this is a tool that's actually very common you've probably seen these these before well there are two things that are different about this though one is that this is all keyed into the online code book so you can go transparently back and forth the other thing is that we can use this tool to crosswalk any of the four and a half million variables in the ICPSR collection because this is drawing directly from our store of DDI metadata and we don't have to build a separate tool for each one this one tool works over all of all of these data sets another thing that we did in this project was to think about how we could extend the online code book and so here's our online code book that you saw before which has the question text and how it was coded but this version has something new in this location here it shows how you got to this question in big surveys every respondent doesn't answer every question there are what are often called skip patterns so you know you get asked what your marital status is and if you're single you go to one question if you're married you go to another question divorce people go to a third pattern so there are there are different pathways through the the questionnaire and what we've done here is tried to show here's how you got to that question which explains why some people didn't answer the question we also represented it in words down here so we we built this and you know we were quite proud of ourselves for for building it because this does answer the question about who we answered this question in the survey but then we ran into a problem so how do we know who answered the question in the survey and the answer is that we get that information from the data providers in a PDF and the only way we could build this demo prototype was to have one of our staff members enter this program flow information by manually into xml for one of data sets so we could show how this works so we we showed a tool that we think is is you know really useful but we reached a a roadblock because we don't actually get machine actionable metadata about this kind of information and the problem is that when the data arrive at the archive they don't have the question text that's something that we at ICPSR and ADA have to type in they don't have the interview flow they don't have any information about variable provenance and variables that are created out of other variables are not documented so the project we're working on now which we call c2 metadata for continuous capture of metadata is about how do you get that and to understand that how we get it you have to think about how the data are created and what happens so first of all the data themselves are actually born digital people do not go around with a paper questionnaire these days they use these computer assisted interview programs they're on telephone or they go around with a laptop or a or a tablet to answer them there's no paper questionnaire there is instead a program and it's the program that's the metadata so technically at the beginning you could start you start with this computer assisted interviewing system and what you get out of it is the original data set but you can also derive from it ddi metadata and in xml and there are programs a couple of different programs that will take these cai systems the code that they run on turn them into xml but what happens next well um what happens next is that the project that that commissioned the data is going to modify the data um there are a number of reasons for doing that there are some things that are in the data they're purely there for administrative purposes there are some variables that have to be changed to reduce the identifiability of individuals some variables that need to be combined into scales or indexes so what they do is they write a script that's going to run in one of the major statistical packages and they take that script and the script and the and the the data go through that software and what comes out is a new data set well what happens to the metadata well at this point the metadata don't match the data set anymore and you would need to update the xml to fix it and nobody likes updating xml so the metadata get trashed and thrown away what happens then is this when the data after the data are revised the metadata are recreated and what happens is that we at the archive take the revised data and extract as much metadata from it as we can so we get an extracted xml file and what about the the things that went on in the script here well we actually have to sit down and extract them by hand so a person has to read the script and write down what happened but what we're working on in well so what are we missing well what we get from them these statistics packages are just names labels for variables labels for values and virtually no provenance information so what we're working on is a way that we can automate the capture of this variable transformation metadata so our idea is this that we're going to write software where you could take the script that was used to modify the the data take the very same script and run it through what we're calling a script parser and pull from that the information about variable transformations and put that into a standard format which we're calling a standard data transformation language and then you take that information and incorporate it into the original ddi you update the original ddi and then you've got a new version of the xml that that is in sync with the revised data so this process then requires two two different software tools one that will read the script and turn it into a standard format and a second one that will update the xml and that's what we're building so we're good we are building tools that will work with the different software packages and update xml we're actually writing these parsers for scripts in four different languages spss sass data and r and the reason we're doing four languages is that if you look at the column over there on the right which is based on downloads at icpsr in cases where the dataset had all four formats you can see that there is not a single dominant format spss and stata are the most downloaded formats from icpsr and they both have about 24 sass and r both have about 12 percent if we did one package we wouldn't please you know we'd be pleasing only a few people and we couldn't have an impact so we're actually writing parsers for four languages here's something i thought that's come out of our work that i you might find interesting and this is about why we need to have a special language for expressing these data transformations so here are three brief programs in spss data and sass that all are designed to operate on the same data and uh you know i tried very hard to make the uh programs the scripts identical and i think that i succeeded but if you run these three programs you get three different results so and the key thing here is to look at the last row that the uh row in which we set the minus one to be missing in spss you get two missing values in stata and sass one of the the variables is is set to a number but it's a different one in each one why does this happen well the reason is that in logical expressions spss treats a missing value as makes the result of the logical expression that includes a missing value missing which in most cases is treated as false stata treats a missing value as a number which is equal to infinity sass treats a missing value as a number which is equal to minus infinity so both stata and sass actually do return a number when you have one of these comparisons so it's actually more accurate to represent the data in this way which you wouldn't see if you just you know looked at at the data set so what we're doing is creating our own language where we're actually using the language that's been created by another community the uh stmx community called validation and transformation language so that we can put all three of these languages into a common core so what are we doing and why are we doing it so the goal of the the project um is is to capture this metadata and automate it if we can capture more metadata from uh the data creation process you know we'll be able to provide much better information to to researchers about what's in the data set automating this process we hope will um make it cheaper for everyone and make it easier and that has been one of the principles that we've tried to do here that you know if we can't make make it easier for the researchers they're not going to do it so the hope here is that uh the software we get will make that make their lives easier and uh here's just uh to acknowledge my uh so my partners in this we've got uh partners from a couple of uh software firms collectica and metadata technology north america the norwegian center for research data um and two of the the two projects i mentioned the general social survey the american ash election study part of the project too so that's my talk thank you very much george we had a question that came through earlier in your talk when you're speaking about people um putting variables into icpsr and searching for for them and me has asked when a user searches for a variable all variables do they need to come up with the exact variable name as in the variable index so uh right now what we're what we're doing is really a text search and when you search for variables you're searching over um the variable name and the variable label um and it also can can bring up items that are in the values for the variables but one of the problems in the social sciences is that um people don't reuse questions very often so we don't have a tradition of reusing questions um and it's very hard to find the same question in multiple data sets um the kind of search we're doing now in in our question bank is frankly kind of clunky and it often misses misses things and that's uh an issue that uh i'm trying to address in some other projects where we're trying to improve the way we we can search over variables thank you very much we've got a question for you as well yes so nick we've got a question how widely is prov used and what have you found to be the main challenges working with prov noting that a v2 is not on the horizon is it easy to update a prov model if a change is required okay so first part first how widely is it used um so my i have a direct interest in things provenance but aside from that i haven't interested things geospatial and uh i guess physical sciences data in that community there's only one game in town that's really prov uh but it's early days so most of the spatial geophysical blah blah blah sorts of places that those hard physical sciences side um they either are using their own systems or they're intending to use prov there's not many that are actually already using prov but there are certainly not many that are intending to use something other than prov outside of my own geoscience australia area um other communities i know i've including ddi and so on uh because prov has only been around for a few years um if people can characterize their problem in a provenance way like they actually understand this is a provenance question as opposed to a uh some other kind of question like an ownership or an attribution question uh they fairly quickly end up at prov um so i think it's it's about as it's certainly more widely used that any other provenance standard has ever been and it's showing signs of being much more widely used than that and that's because any other initiatives in the space have been sort of swallowed up by problem now the second part of the question was uh one of the problems and i've identified one already which is people have to know that they're asking a provenance question so we get a lot of questions which are synonyms for uh provenance questions probably much like variable naming where people say i'm interested in the lineage of my data or the transparency or the process flow or um the ownership or attribution um and those are all or could be all provenance questions the hardest thing is to work out is um uh specifically what questions are being asked and then if there is an existing metadata model or something in that space already what's it doing and what's it not doing and therefore do we need provenance a specific provenance initiative so for instance many metadata models have authorship ownership creator uh information indicated in them so if your provenance question is i want to know data sets created by nick that kind of provenance question you can usually answer in other metadata systems you'd have to have something a bit more complicated than that and and term it provenance uh to then think about using a provenance system the other thing is the move away from what i call uh point metadata where you've got a single thing with a bunch of properties that come from it so a study or a document or a chunk of data with a bunch of properties that's one way to do things but what prov and what other models are interested in is whole networks things are made to other things it's more complex but it's much much more powerful to do that great thank you very much so question for george how is sensitive data variables or values controlled for during the c2 automatic catch up icpsr has a confidentializing service on ingest is this process carried over to the c2 metadata project is this activity captured in prov like metadata so the um the c2 metadata uh model is to operate solely on the metadata uh not on the data um so that's really uh so it doesn't really play into the issue of uh confidentiality if you're interested in two weeks we're going to have another webinar where i am going to talk about uh how we manage uh confidential data but in general it's it's rarely the case that we have to mask the metadata uh of a of a data set for confidentiality reasons obviously controlling the data is something else so we've got another question here for george um your script pass so that reads from sas script do researchers need to would they need to install that in their sas package uh we haven't gone to that point yet but probably not probably what we'll do at least as a starting point is offered as a web service and what you'll do is simply export your sas program into a text file and upload the text file to the um to the web server web service and it will download a new xml file so we've got another question here does prov support the workflow of creation and approval of provenance data e.g the prov entry is proposed and has been submitted to the data custodian for approval that's got two kind of answers to it one is a generic prov answer and the middle one seems to be more in line with a particular repository doing a particular set of steps so um this isn't exactly what you asked but i'm going to answer it in a slightly different way you can talk about the providence of providence which is a bit tricky but say you had say you had information about the the lineage or the history of a data set and you wanted to control that chunk of stuff you could talk about that thing being a data set itself even though it's about something else and manage that and you could certainly work out how to link your data set to the data set that contains its providence information so you can do that but the second part of the question or i think the general sense of the question is more to do with how does a specific repository do things is that does that make sense does the prov support the workflow of creation and approval okay in general and you can represent anything in prov because it's really high level and it's got those three generic classes of entity activity and agent and there's almost nothing in the world that i've come across that you can't decompose down into one of those three things is it a thing is it a cause of agent or is it a temporal occurrence um so in general modeling work for us yes one trade populates and so Natasha asks philosophical question for the whole panel how do you think providence relates to trust so i'm going to just jump in very quickly and say um providence models before prov often had the word trust in somewhere um and and many of the the motivations for providence models were to do with trust uh we deal with trust as the the goal of geoscience Australia to put out data and make it open and transparent it's fundamentally a trust issue for users of that data they want to know how haven't how did the state come into being so that's that's so that's really what providence is about it's about telling about the history of something so you can you can generate all this data trust um uh then the specifics of what you put in there and you can work out do i trust the people who created this thing do i trust the the process that was undertaken to to deal with little transformers do i trust the particular chunks of code that we used um so that's the generic answer um then there's a sort of more specific ones like um uh for data in this repository how do i trust that it's even though you're telling me something about that it's in fact true um there are also very difficult things about how do i actually trust this metadata even if it looks like it's occurring this data comes from god deliver to you on a stone tablet um i could write that down but it's a true uh you have to work that out that's a that is a now non-providence thing you have to work out some other way of attributing a trust metric to that claim and that might be that it's digitally signed and you trust the agency that delivered it so that's an appeal to authority and you might trust that there is enough information present for you to understand the process enough to have confidence in it uh you might it might link to well-known sources like open code or something like that that you trust um or maybe there's a mechanism for you to validate certain chunks of data or calculations so the total number was five and you can look back to the province and see somewhere two plus three when you see five that that you can calculate and you can establish that trust directly so i think uh nick nick said it very well but i'll say the same thing in in a few words that thank you that is is really fundamental to trust and nick really hit it on hit the nail on the head when he talked about transparency that provenance is about transparency and in the world we live in now even appeals to authority don't work very well anymore um and i think that for for science to to have to gain legitimacy and gain trust we have to be transparent and that's what what provenance metadata is all about so we've reached the end of our time uh and i'd just like to thank our three speakers for coming along uh to our camp and camber office today and speaking to us about provenance and introducing lots of new acronyms to us all um every time i encounter anything new at ends there's always more acronyms to learn um so thank you very much for coming we have two more webinars in the social science series so hope to see you there again