 Okay. It is just a little bit after four o'clock and it's time to get started. Welcome everybody to CNI's first virtual meeting and it's opening virtual plenary. It's certainly a very different world than what we imagined when we first laid our plans for San Diego. And there's a lot more we can and we certainly will be saying about that in the future. Today I just want to say that I'm delighted you're all here. You're all hopefully well and safe and with us. And I hope I'll continue to engage with us over the course of this virtual meeting. I just want to say I know that all of you who are here have at least a pretty good understanding of how we're doing this virtual meeting because you wouldn't have gotten this far if you didn't. I do just want to note that we will be doing the closing virtual plenary tomorrow at the same time. We had a very successful executive roundtable earlier today and we'll have another one tomorrow. And the reports on that should be quite interesting. We have had quite a good discussion of recent events and how those have changed strategies for acquiring teaching and learning materials. We will be adding a few additional things that weren't part of the original plan for the San Diego meeting. We'll be reaching out to member reps to participate in another extra executive roundtable. We'll also be calling for a few additional presentation slots on late breaking events. So watch for those and we'll keep you posted. I don't really want to say much more at this point other than once again to say I'm glad you're with us. And we will be reaching out to you to try at the end of this virtual meeting to learn as much as we can about what worked and what didn't work for future planning. And to help us understand different kinds of events we might be able to do in the future including perhaps some things that have hybrid in person and virtual components. With that, let me just say two or three quick mechanical things and then I'm going to introduce Rob. We are doing this, I suspect many of you are intimately familiar with Zoom at this point having spent most of your life in it for the last week or more. We are doing this technically in what is Zoom webinar mode as opposed to Zoom meeting mode, which means everybody has come in muted except for our speakers. And aren't generating video which will cut down the load on the poor overworked internet a little bit. If you have questions, the best thing to do is use the Q&A tool and we'll do questions at the end of the meeting. We're not going to try and deal with raised hands as we go. That's just a little bit too complicated. At the end of the meeting after we complete the formal Q&A and presentation, Rob has agreed to stay around for a few minutes if any of you want to chat informally with him. So if you do, there'll be a brief opportunity for that at the end of the meeting. With that, let me introduce our plenary speaker, Dr. Rob Sanderson. I was trying to remember when I first met Rob. I believe it was in Liverpool, if my memory serves me. Quite some years ago when he was working on this amazing PhD dissertation. You may not know it, those of you who know Rob particularly for his technical work. But he is also a really serious digital humanist who has done really serious digital humanities work. And he did just an amazing dissertation which sort of foreshadowed a whole lot of things that have happened since and how manuscripts have migrated to the web as these multi-layered documents. After finishing up in the UK, he came over to the US. He spent time at Los Alamos with Herbert van de Sample's team there and then did a stint at Stanford where he was very instrumental in getting the international image interoperability framework off the ground. He really did a tremendous amount of cross pollinating across various projects and silos at Stanford. And then about four years ago now, I believe he came to the Getty as their first semantic architect. And it's worth reflecting just for a minute on what kind of an amazing institution creates a position in the early 21st century called semantic architect. And what that says about how the world of cultural heritage is changing. As you'll hear from Rob, it's changing, but it needs to change a whole lot more. And he's going to give you some, I think, really important insights into that. Before turning it over to Rob, and I'll materialize back at the end to help the questions flow. I just want to express my profound thanks to him for working with us through all of this uncertainty and complexity and just being willing to be here for us under very different circumstances to do this plenary. So many, many thanks to you. Virtual applause from everyone and welcome over to you. Great, thank you Cliff. And thank you everyone for joining in these, well hopefully uniquely challenging times. Thank you also to all of my colleagues, some of whom Cliff mentioned, who are on the sample at Los Alamos, Tom Kramer at Sanford, and Willie Projil, David Newbury at the Getty. And all of my colleagues in the communities, AAF, the W3C and so forth. This presentation really just repackages a lot of the work that has gone on over the last 20 years in cultural heritage and research data. And equally, I would like to beg your forgiveness and understanding as the virtual guinea pig for this new world that we find ourselves in. So there's no need of course to justify access to research data to this crowd. It's in the name of the coalition after all. So what I would like to talk about is instead some of the challenges that we have in the cultural heritage sector about data and publishing cultural heritage data across diverse institutions and diverse sub-domains. With multiple modes of access and multiple quite inconsistent audiences, but one that I will argue we can approach in an incremental fashion. And of course, particularly these days with financial and social stresses on the system and how we can improve the situation for such a research data ecosystem to be more sustainable and more accessible at the same time. So everyone has some understanding of cultural heritage, of course, and your thoughts when I mentioned that topic might go to beautiful paintings such as this Rembrandt housed in the Getty Museum. Or it might go to photographs here, a photograph taken by Ed Rusche in his Streets of Los Angeles series. But when we go to look for that photograph online, because it's housed in the Getty Research Institute, the interface that we are presented with is very different from the interface that we would be presented with if that exact same photograph was housed in the Getty Museum. Here we have the representation of the series in which that photograph is part in the Getty Research Institute's library interface. And you'll note at the top that it's available in the Special Collections, blah blah blah, please come to the reference desk, and a finding aid. If we click through to that finding aid, we get yet another very different interface. So here the creator or the author at the bottom, Beth Angaine, is the person responsible for the finding aid rather than the artist who is listed at the top as a collector. So why do we have these vast differences even within a single institution? I would argue that it is because of the diversity of the subcultures within the cultural heritage domain, and also the scale of the collections that we manage in those various subdomains. So libraries of course have many, many non-unique information carrying objects books. This means that there is a chance for copy cataloging and a very rich environment replete with standards. Archives have many unique information carrying objects. So the archival content is most accessible in its massive scale. So this means that cataloging is often at the collection level rather than at the single item level. This has also meant that the standards are intended to describe the collections typically and not to describe the individual objects, even if they are unique. Museums have compared to archives and libraries relatively few objects. They are all relatively unique. But instead of carrying information and text, they typically carry some sort of image, be it a photograph, a print, a drawing, a painting, or even you could imagine the sculpture carrying some image content. The museum domain, conversely, has a very low number of standards, and that's one area where we'll discuss further in this presentation. I know that cultural heritage is often described as libraries, archives, and museums, sometimes adding gardens or galleries for glam. I would also like us to consider conservation and conservation science as part of cultural heritage. So conservation science is traditionally less concerned about metadata and more traditionally concerned about scholarly communication. So how can they communicate their findings about conservation activities that they are performing? Now this is very stereotypical, and there are of course many exceptions that I'm sure everyone is shouting at their screens about. One example would be Yale's Beinecke Rare Book Library, which might be thought of more like a museum because the objects are more unique than a typical library. The point is that we need to take into account cultures and practices of the various sub-domains when we are thinking about access to research data for the cultural heritage industry. So when it comes to industry, we might think of ourselves, or we do think of ourselves according to this only E&Equals33 Twitter poll, as part of the knowledge industry. And with a smattering of entertainment, warehousing, and technology. But when it comes to what we do on a daily basis, that knowledge industry decreases way down below 50%. We spend our limited resources as cultural heritage organizations for very different purposes. So not only are there various sub-domains with different practices and different cultures, there are different very diverse objects and collections that we need to deal with. Cultural heritage organizations are not primarily research organizations. We need to balance the publication of knowledge and data with being amenable and attractive to the general populace, the entertainment side. We need to deal with looking after our collections, warehousing. And particularly these days where no one can visit those collections in person, we also need to be attuned to technology needs. So what is then a vision that can include and position research in a practical achievable manner alongside these other requirements? So how about something like our diverse cultural heritage is digitally available and easily usable, we'll come back to the usability a lot this talk, by a sustainable application to further public engagement and research. So to this end, I think there are three core challenges and the colors here delineate the sections of the presentation. There is the diverse and inconsistent data that we publish due to the intentionally diverse, the welcomely diverse collections that we have and the less welcomely inconsistent way that we look after them and describe them. There is a plurality of access methods for the multiple audiences that we are trying to simultaneously serve and like many industries, there is a lack of strategy around sustainability for those products. We might also consider this as three core requirements being shared abstractions that lead to sustainable implementations that lead to satisfied audiences. So without shared abstractions, we can never speak the same language that our data will always remain inconsistent. Without sustainable implementations, we won't be able to keep it available. And unless we satisfy our audiences, we will never be able to prove to the people with the purse strings that we need to continue to do this. So on to the first part, the purple part, how to manage our data. So how do we get shared abstractions across a very diverse set of sub-domains? The way that I have been thinking about this and again as part of discussions and collaborations with many people is this tripartite structure. So we have first a conceptual model, which is the abstract way that we can think about the domain in its entirety. So we need to think about the domain using conceptual models holistically. It can't be just purse of domain. Consistently, it needs to be broadly consistent across the entire domain and coherently. We need to be able to apply logic to these sorts of things if not necessarily inputs. Further, we should have as few conceptual models as possible because if we change the way that we are thinking about things or people are thinking about data in different ways, then that data will necessarily not be interoperable and therefore we will not get to some degree of consistent, accessible, sustainable ecosystem. However, just because we think about things in the same way doesn't mean that we need to write it down in the same way. So we can have multiple ontologies, which is the shared format that allows us to encode the thinking in a machine actionable way. I don't mean necessarily machine interpretable, it will come to audience later, but machine actionable. We need for machines to be able to process it, not necessarily understand it and draw inferences from it. And finally we need bookabularies, which are the curated set of sub-domain specific terms that can make that ontology more concrete for the particular sub-domains. So today, in our research systems, we tend to collapse all of these things down to one particular document or one particular standard rather than separating them out and allowing them to be composed in different ways as appropriate to the particular use cases. So there's our first challenge. How can we think about things separately, describe them consistently and use sub-domain specific language or vocabularies to talk about them? The goal of this abstraction layer has always been towards completeness. So we want to be able to express everything that anyone might want to document because the fear is if they can't, then they will go off and embed their own, the XKCD's 16th standard. Or as my colleague Lily Bridgill likes to say, there should be no data left behind. And that applies equally to record instances where the goal for the individual records is correctness. Because errors in our records impact the confidence that the users have in the data that we are publishing, which then impacts the reputation of the institution for publishing bad data and the reliability of any research which has then been performed using that bad data. However, here is the issue, the perfect is always the enemy of the good. Left unchecked, this process of abstraction, of mapping the existing records into these formats and of cleaning the data to be as perfect as possible will consume all available time and effort and eventually no data will see the light of day. Meaning that we have wasted all of the time that we have put into this, all of the resources that we have assigned to this work, unless we can overcome this particular challenge. How can we do that? Well, there is a need for application profiles. Not everyone needs to know everything. There are particular sub-domains and within those sub-domains there are particular foci of interest in particular data sets that do not need the entirety of the model, the entirety of the ontologies or all of the vocabulary terms. So we can specialize this by selecting the appropriate abstractions and documenting that selection as an application profile. One such application profile that we are working on is the linked art model, which is a linked open usable data model for cultural heritage, which is collaboratively designed to across organizations to be easy to publish and easy to use in consuming applications. Our design principles focus primarily around usability and not about precision and completeness. We want consistency and we want to engender community by assuming that we can get to 90% of the use cases, not 100%, but only with 10% of the effort being put in. If you'd like to see more, then our new top level domain dot art has been immensely useful and the URI is linked dot art. In terms of community, we have a wonderful set of people and institutions involved. So you can see some hopefully familiar museums and collection owning institutions on the left hand side there from around the world. And in the right hand column, there are research institutions such as Yale and Oxford, but also aggregators of information such as Europeana and the Canadian Heritage Information Network. We are formalizing the profile in the International Council of Museums and are immensely grateful to the Chris Foundation in the States and to the Arts and Humanities Research Council in the UK for funding to be able to work on this project. So on to the distinctions between access and audience. So if the data is the what the thing that we are interested in, then access is how an audience is not who but why we don't really care who you are. We want to know why you are interested in having access to the data so that we can best serve your permission needs. This partition then gives us a very useful split between usability being the focus of the access part and use being the focus of the audience. However, it also underscores the need for partnership between the publisher of the data and the consumer between the cultural heritage organization and the researcher. Why? Because the cultural heritage organization cannot publish absolutely everything that they might want to. Nor can the researcher know that they will be able to get access to everything. But without some degree of collaboration, the cultural heritage institute won't know what they should be publishing because they won't know the use that is intended for it. So we need a degree of partnership, particularly around use cases from the audience and usability from the side of the access. Technically, how do we get access to data? Well, access to data is handled via APIs or agreements preceding interactions. And yes, semantic architects and or pedantic architects. I do know that API really stands for application programming interface. So APIs are how programmers interact with data across system boundaries, but they are social contracts that we make with each other that by publishing an API. We say this is how we are going to let you interact with our data and we will maintain it in that way such that when you write code, we won't go changing that API willy-nilly and breaking your code. So there is some degree of trust across this boundary that we need to be able to establish. Equally, the distinction between profile and API, the profile is on the data side, the API is on the audience side. So the profile then, the metadata application profile, if you will, is a selection of appropriate abstractions to encode the scope of what can be described using the data that will be available. The API is a selection of appropriate technologies that give access to the data that's managed using that profile. Some examples, the scope would include things like the classes that are used, the properties and relationships that connect the instances of those classes, and the vocabulary terms that make the ontology more specific. Access, on the other hand, with the API, is about the document formats that are available, be that XML, JSON or other. You all write patterns that are in use for publishing the data and the operations that you can perform using the API, be that create, retrieve, update, delete, or things like browse, search and similar. One exemplary organization that I've been very fortunate to be part of over the last decade in terms of publishing APIs is AAAF, and I'm certain that most of, if not all of you will have seen this slide before, so bear with me. AAAF is a community, first and foremost, that develops APIs. It implements them in software and uses that software to expose interoperable content. The four AAAF APIs, image, presentation, authentication and search, but here's the point. They focus entirely on media and not on data. They focus entirely on presentation of that media, not on interpretation of data for research. It's even in the name. We used to call the presentation API the metadata API back in version 1.0 days, and we renamed it for version 2 because we realized that we are focusing not on providing access to metadata or data. It is how can we drive a viewer to present media in now in version 3, including video and audio, to an end user such that the human understands the objects that have been digitized. The focus then has been entirely on the usability of those APIs for software engineers to accomplish their tasks. That has meant that we have various design patterns for our APIs, and these patterns are about access. They're not about the data. So, the important ones of the TNO are through them all. Number four, make the easy things easy and the complex things possible. I think this has been one of the most critical ones that we have adopted. There should be an easy on ramp for people to very quickly become productive, and then as they progress through their implementations, they can then layer on additional complexity and additional knowledge when they have time for spending additional resources. And also number eight, design for JSON-LD using linked open data principles. So JSON is the lingua franca of web developers these days, and linked open data or the semantic web has often been a curse word in those circles. So by focusing on the JSON serialization, we make the information usable while still sticking to our principles of connecting across institutions and datasets. So why has AAA been so successful? And importantly, how can we reproduce that success elsewhere? I argue that is because of this focus on usability and on community. The community as a whole has been very responsive to itself and to others to enable the software engineers to be productive. By focusing on usability of the API, then this gives us a correction function for the abstraction layers tendency towards completeness. There is a trade-off that has to occur here because as the title of the slide says, the API is the user interface of the developer. We need to pay as much attention to the design of the API as for usability as we do to web interfaces or any other user interface for mere mortals, people using the data. And we must take note of this for research environments. If we do not reproduce this particular pattern, then we would tend as previously towards this correct and complete end goal without any concern for usability. And then if it's not usable, it clearly will not be used. If it is not used, how can it be sustainable? Further, it needs to be consistent. So, one of Triple F's main reasons, I believe, for success is that has focused also on consistency. There have been tools put into place to be able to validate whether or not people have implemented the APIs in the way that it was intended. There are multiple implementations of clients that can be used to check to see, did I do this right? Rather than just hoping that people read many pages with a documentation can then convey that to a software engineer who will then be able to implement it correctly. This has meant that the cost of implementations has been greatly reduced. So, not only are there these tools, there is also community. A response of community that can answer questions of software engineers or of users or anyone else when they come up in a very quick amount of time compared to previous days of sending an email to a mailing list and waiting for a week moving on and never returning to that particular question. One of my former colleagues at Stanford, Tom Cramer, to whom I owe a great deal, has been very interested and is interesting in this respect because of two particular projects that have come out of his team. The first, formerly called Hydra, now called Samvera, is essentially selling a product, a particular digital library software implemented in Ruby. AAAF, also coming out of a Stanford project, instead focuses not on the implementation but on these APIs. So it can sell community with lots of interoperable products rather than a single product and a single technology set. One of the other important design patterns for AAAF is to not introduce unnecessary technology dependencies. This focus on usable consistent APIs to give access to the back-end correct complete data gives us this sustainability pattern. So, if we replicate this pattern multiple times across multiple institutions, we then have audiences that we need to serve using it. So I believe that there are four primary categories of audience and remember these are not who but why. So these categories are humans, machines, the network and research. And I believe that these categories build upon each other in an incremental fashion. We can start at the bottom and work our way up. We can start with data for humans or strings, work our way down towards data for research. So data for humans then is separated entities, so the object separated from a person separated from a place. With attached textual descriptions that could be displayed to a human and the human can read the text and understand what's going on. So this is essentially the manifest from the AAAF APIs. Then data for machines or structured data. Here the entities have machine processable values. So these values would be things like dimensions, not as a string but as a number and a unit. That you can then compare five inches of one small statue to 15 feet for a very large statue. From there we can go on to, if you'll forgive me, distributors or data for the network. And here the entities have the same structured data, the same human and individual data, but are connected across systems and across institutions. So here the data is to enhance the network effect so that we can find other data and benefit from other institutions knowledge. Finally, once we have the network effect up and running, then we can focus on stringent data or data for research. Here the data needs to have sufficient accuracy and be present in sufficient quantity that we can answer research questions after aggregating that data together. So I believe that there are five C's for research data, much like the five C's of diamonds. However, unlike color clarity cuts and so forth, these are consistent. The data must be consistent across implementations and across institutions. They must be connected for that network effect to take place. I know that this is a cliche these days, but we are all on this together. It must be collaborative. We will not succeed in doing this all by ourselves. We have to work on collaborative products and projects. That must be correct and complete enough. It can't have many, many, many errors. Otherwise the research will not be successful, but we can't prevent that from letting us move forwards. And it must be contextualized. So this one I'd like to highlight because I believe the others are relatively straightforward. So contextualized data is important because the users must understand the environment in which the data exists sufficiently well to have confidence in their use of the data. Can they put the data to use to their own end? I think there are two primary factors that I've heard in this realm that we could consider. The first is data provenance. So who created data when, why and what situations and so forth. There has been a tendency to try to put this into the data itself, but I do not believe that that's necessary. Instead, I think that we can have a description of this intended to be readable by humans per dataset. Why? Because the publishing institution's reputation is almost certainly more important than any documentation that they could put out. If the Getty is a reasonably well respected organization puts up some data and another organization that no one has ever heard of published exactly the same data, our data would almost certainly be treated as more reliable than the other data. And nor I haven't seen any studies of this. If anyone has, please let me know. I would be fascinated. So secondly is about uncertainty of assertion. So this is really the confidence of the publishing institution in their own data. So a user should not feel more confident in the data than the publisher is perhaps. The issue is that this is going to need to be at the data level and needs to be present so that it can be taken into account during aggregation. However, this drastically reduces usability if it's accessible for that sort of computation. So here we have this completeness problem again. There's the resolution. I believe there is. I believe it is not technical but social. Although Cliff in his introduction praised my work on digital human in digital humanities. I'm not sure that that's warranted. My feelings about digital humanities is instead that we should be thinking of it more as corpus humanities than digital humanities. We don't need perfect certain data. We just need to ask corpus appropriate questions of the large amounts of imperfect data that we actually have access to. We should be looking for broad beam illumination of patterns in the data rather than laser focused specific research questions. This approach would minimize the challenges of the uncertain data and whether or not we can publish that uncertainty along with the data and maximize the benefits of the network effect by being able to gather large amounts of information together. Admittedly imperfect and incomplete. Particularly in humanities this is important and in things like art history that are both historical adding to the uncertainty and subjective. If I say this is a genre painting. Does that mean anything? Does that mean anything to you? Who knows. So the fact that me as a potential digital humanist has described it doesn't shouldn't lead you to much confidence compared to an art historian. And art historians as anyone will tell you will also come up with at least as many answers as to the style or genre of many objects. So onto sustainable access to large amounts of imperfect data. There are three possible options for this I believe the first one and so we could have a single centralized platform that we all agree upon my notes say to pause for incredulous laughter. I'm imagining 158 people giggling slightly at that. Or we could have a distributed ecosystem in which every institution participates to their own ability so to one of those four degrees of intended audience. I believe that the first one the centralized platform model that is ultimately not scalable. We will have so much information available by all of the cultural heritage and research institutions that we would strip the capacity of any single system to manage all of it. It's also as Kathleen for Spectric so clearly argued and described in her keynote last spring. Any centralized platform is prone to being commercialized so taking an open system and making it only profitable and viral paywall to being shut down because they do not want to do that and don't have the resources to sustain it or in some other way exclusionary which we do not want. So on to the other two. My first C&I was in spring of 2004 when Ray Denenberg, Ginny Walker and myself presented the NISO meta search initiative and building on top of SU and SRW to be able to distribute queries around the web. Retrieve the results sets, merge the results sets and display those results to the user. It seemed like a good idea at the time and but similar sorts of time. There was also things like Google and the beginnings of your piano or PMH was going strong and which has been led to resource sync. So given that unlike a Bryn and Paige I'm now not worth $50 billion and it seemed like a good idea at the time but really it wasn't. And instead we should be looking to solutions that have individual institutions publishing their own data separately and then many organizations rather than the singleton Google. Harvesting that data and building custom specific search and research platforms on top of it. In this space I've been asked in the past the difference between two four letter acronyms, fair and loud. Fair comes from Force 11. I'm sure that everyone at C&I will have heard this term before. So findable, accessible, interoperable, reusable. I feel that these are functions of the system mostly other than reusable which is clearly less in space. As opposed to loud linked open usable data, which is features of the data. So linked, it's connected open, it's reusable, usable, it's usable and data is well data. So while we're in the practice of creating un-Googleable acronyms for things, I would like to propose one more. How about research ecosystems being shared? So sustainable, clearly harvestable, available, reconciled, enhanced and discoverable. And this is going to be the remaining third of the presentation as running through these features. So if we are really part of the knowledge industry, then we should be treating our knowledge assets as core assets of our work. And for that, they need maintenance and they need appropriate governance structures. The same way that we have maintenance and conservation, the same way that we have facilities that looks after our buildings. The same way we have conservation and cleaning for our museum objects. We need these exact same things for our data. We need to ensure that it is available, usable, accessible and so forth. This means that we need to be changing our mindset from data being the outcome of a project to data being an ongoing and sustained product of the institution. An asset that we need to invest in rather than to leave to gather digital dust. This of course needs different governance structures. I think there is one project, if you will, one initiative which is exemplary in this regard, which is the Philadelphia Museum of Arts Art Information Commons project. So yes, a project, but wait, it is a project to a planning project funded by the Mellon Foundation to determine how to embed data and knowledge work as a critical component within the institution. The project is about demonstrating the possibilities to change hearts and minds within the institution such that at the end of the project, there isn't just data which is left to rot, but there is a change of culture within the organization such that the data won't rot, such that they can then go on to do things like reconciliation and enhancement. Because without these initial governance structures, there is no chance that the data will be usable. Or indeed, this is the alternative. The Brazilian National History Museum spent 200 years not investing in infrastructure, and the result is the total destruction of the museum and its collections. They didn't even have work in front of this. This is the outcome if we do not invest in infrastructure. So, in order to redirect resources into this infrastructure work, in order to be able to treat data as an asset, these changes need to be in the self interest of the institution. I believe that they are, because as we publish data, we can then increase our institutional reputation by ensuring the use of that data. Institutional reputation or global brand awareness, if you will, then can derive financial revenue from philanthropy, ticket and product sales. This is exactly the same reason why the football coach is the highest paid member of staff at many institutions. It's not that the football coach contributes anything to the research outcomes of university. It's not that the income from ticket sales is a significant impact. It's the next step. It's the first degree of separation. It's the brand awareness of the university through its football team that makes that so important. And this is currently reasonably costly for the museum space, because we do not have these standards. But, like we've seen with AAF, once we get over this initial startup phase, and if the system is adopted internally, a very important part, then we will be able to benefit from the same practices that we have with replay. Of course, we'll still need to prove it, but without those television viewing audience statistics that football teams love to throw around, which of course means metrics. So in 2018, the Open Data Institute published a white paper about the sustainability of data platforms. They had three particular ways to track data usage. First was user tracking. The issue with user tracking is a lack of openness. It's also a barrier to entry in that if you have to sign up for an account or get an API key, then you would put one more barrier in the way of a software engineer trying to do something with your data, and they will move on to the next one. The second option was just use GitHub, e.g. put the data into GitHub, use GitHub's ecosystem of forks and pull requests and watches and stars to track the metrics around the use of the data. Another perceived enhancement was that people can submit pull requests to change the data in the GitHub methodology. The issue with that is the lack of synchronization between the data published to GitHub and the backend system that is the system of record maintaining that information. By the time there was a pull request on the export in GitHub repository, the internal system of repo will have moved on and you will not be able to integrate back again. The third option was data citations via web searching, which of course has exactly the same challenges as all scholarly communication. What is the impact of the publication? However, I would argue, it's much easier in this case because we do not need to demonstrate the value of a particular researcher or a particular data set or a particular paper. All we need to do is demonstrate a positive feedback loop to the institution. Remember, this is about driving the reputation, the global brand awareness of the institution, not of the data set. So by simply linking back from the Getty Vocabulary's openly published data back to the Getty, rather than to the particular term, we can see, oh, we published this data. Now we're getting more hits on our website, which is where we can advertise our exhibitions. This means that citability or citation is a better metric for sustainability than pure openness, which is perhaps controversial. However, I think Dan Cohen expressed this the best back in 2013 when he was at DPLA. He has a CC0 plus by meaning it should be CC0, so legally open and licensing wise unrestricted use, but it should be moving from legal into the social space for the citation. So legally open but ethically sourced. Okay, so I'm running close to time, so moving on. Harvestable for H, so here we have the method of constructing the ecosystem from the individual publications. And the harvesting method needs as an API to be usable. So this diagram, and I won't go through the details, is a layout of the AAAF Change Discovery API, which is in the works. The important thing to note is this metadata box that I've highlighted there. This metadata is real metadata, not just presentation data. So here, by using the presentation ecosystem, we can start to bootstrap our research ecosystem on top of it. Because the search indexes there in the middle are search indexes over top of metadata, not over top of presentation data. The API goes back to the manifest. So therefore you come back to see the object in a nice environment, but the searches take place over the structured comparable network enriched data. A is pretty obvious and a useful vowel for the acronym. I'll touch briefly on the second two points there. So we need caching and replication infrastructure to be implemented. So this is for performance, but also for preservation. We need to ensure that the data is available, even if the Getty's network link goes down. And my social media influencer friends, of whom I have none, have informed me that multi-channel publication is the buzzword du jour. Which essentially means, if I understand correctly, that it is the same knowledge that is available and used from the same system, but via very different interactions. So the same source and available via different methodologies. R for reconciled or reconciliation. Reconciliation I would say is the grand challenge of our time. It is this diagram that I put together for a Mellon Foundation symposium on reconciliation tries to show it's very expensive to do correctly. So at the moment Yale under their Vice Provost for Collections and Public Communications Susan Gibbons, I would say is an exemplary institution in this regard. And that they're putting together a three year program to look at reconciliation across their cultural heritage collections and divisions. So the libraries, the archives, multiple art museums and the Peabody Natural History Museum. Here at the Getty, we have adopted the Open Refine platform for two different cases. The first is that our catalogers and metadata analysts use it for reconciling our data. And we use the Open Refine API to publish the Getty vocabularies such that other people can also do the same thing. Now to use the Open Refine methodology and platform to reconcile data to the Getty. I think this is where a centralized platform such as Wikidata comes in handy. So Wikidata is essentially a centralized hub for massively distributed QA of crowdsourced reconciliation. It's not good for publishing research quality data because the confidence in that data cannot be very high because anyone can edit it. And anyone will edit it if Wikipedia is anything to go by. So instead we can use the trusted sources, the actual institutions, we can use their data for the reconciliation and use the hub to find it. So this is also about discovery. Enhanced. So I think there are three areas of enhancement that we need to think about. The first is progressive enhancement of our data internally. So if we can get to data usable by humans or strings and then work our way towards research data, that's better than simply waiting for the data to be perfect and navigating that. But we also need to accept external feedback and external corrections to our data. We need to allow other people to help us to improve our data. And that needs to have a reasonably fast turnaround because if you submit something and you don't see any change, there is no positive feedback loop to the end user that they should do it again and hence they won't. And our crowdsourcing platforms such as Zooniverse play on this, particularly using gamification techniques. But here we need another piece of cultural change. We need a greater tolerance for both the presence, because we know that these errors exist already, and the introduction of new errors. We need to ensure that those errors can be corrected when they are found, but they are already there. We know they're there, and we are okay publishing things today. We just need to double down on that. Once we have doubled down on it, then we can see, and I believe we will see, a dramatic increase in the use of machine learning techniques such as the ones being investigated in the Farros Consortium for photo archives, where there is so many objects, such as in any archive, that they cannot be catalogued individually and they need to be done by machine. Finally, discoverable. If we can't find it, then we have missed the vote. So, harvestability is important. The connections across the institutions are important. It needs to be part of the web and not just on the web, so the connectedness, but also the use of JSON-LD that plays nicely with web SEO tools. So, nowadays, JSON-LD is in more than 25% of all websites around the world. This is now a ubiquitous standard technology that we should be all adopting. And finally, we can look at this through a lens of partnerships with industry, because the clients and the products also need support and to be sustainable, just like the data that we maintain is. I would argue that universities and cultural heritage organizations in particular are not well positioned to support many end users in the same way that the technology industry is. So, in summary, we need a shared ecosystem of fairly loud data. I believe that we can get there, which would be via a consistent set of data and implementations using conceptual models, ontologies, and location profiles. These need to balance completeness and correctness versus the usability. And this can be done via a publishing and harvesting ecosystem, which is then bootstrapped on existing media and presentation ecosystems that can be used and supported internally, as well as externally and in partnership across organizations and across industries. So, thank you very much for your time. If you would like to stick around and ask questions, I would be greatly appreciative. If you'd like to see these slides, the link there is to the slides here, and they'll link from the presentation page and CNI. And if you are one of those people who say, don't pass me the mic, I can just shout. Yeah, no, not this time. I'm afraid you should definitely use the mic because I will not hear you otherwise. So, thank you very much, everyone. Much applause, Rob. Thank you for a wonderful talk. I have to say the amount of design wisdom that you have packed in there is really, people will be digesting this for quite some time. There are many important lessons and points in there. And I'm grateful for your ability to look across so many projects and tease out where they've gone right and where they've gone wrong as part of this. I'd like to invite anybody who has questions to hit the Q&A button on the bottom of their screen and we'll see what comes in. No open questions yet. I'll ask you a quick one. Well, people are cogitating. I was really fascinated to see the question of reconciliation of data appear and the way you called that out as one of the really key and largely unresolved, indeed even unexplored problems. And I have been struck by how many times that comes up now. And the tensions that show up between reconciliation and various kinds of data flow across institutions and across contexts. The problem being you sort of reconcile it in one place, but by that time, something else has happened to it in another place and you can't get back to it. And then you've got to, and then you have to figure out how to reconcile that again. You have any thoughts on how to really kind of do this at, you know, big ecology scale. Yeah, so I think the notion of reconciliation hubs is important. As I said, there's this tendency to think of Wicked Data as the source of all knowledge or the potential maintainer of all knowledge in the same way that Wikipedia is the source of all knowledge. But really, I think there is two aspects to it. One is discovery of the object and it's your or the identity of the object. So the first one, do I know the URI or can I find it somewhere? So if we can't find the URI then to reconcile against, then we can't go any further. So having hubs that focus on collecting and managing pre-reconciled identities and identifiers is important. And then secondly, one thing I didn't mention was that the vocabulary data is data, it's research data. Particularly for things like ULAN, here you can see the Union List of Artists names. There's a lot of research that goes into establishing the identity of a person or the identity of a place in TGN, similarly in every other vocabulary. So if we apply the same principles for the instance data to the vocabulary data and use the same ecosystem for managing it, including sustainability, harvestability, availability, reconciled against other vocabularies, enhanced across the domain and easily discoverable, then we can ensure that there is good access to the data which will then improve the ability for people to more cheaply reconcile their data. So first sharing the reconciliations broadly so that not everyone has to do it, and publishing the data that enables that reconciliation to take place in the first place. I think those are the two critical aspects. I think this is going to be more of a snowball effect that now there's some reconciled systems, some reconciled data, but as there's more and more, then the hubs will start to appear and we will see it becoming easier and easier to reconcile systems into those core vocabularies. Thank you for the question. That's a really fascinating question. So we got one in here. The issue that's raised is fear data and the notion of fear data doesn't really capture ethical dimensions very much. How do when you think about memory institutions, archives, museums, cultural memory institutions, how do you, they of course have very strong ethical considerations around their data. And I think generally are very mindful of those considerations. In the kind of world you're envisioning, you actually see a lot of corporate partners coming into play as well. How do you see sorting the various ethical and motivational incompatibilities there? Yeah, I think I got that question right. Someone will tell me if I didn't. So I think there's the Smithsonian said this well recently with their publication of millions of images and datasets that they have tried to publish openly everything that should be open. That's not everything. Not everything should be open. One of the collaborations that I've been part of in the past, the Indigenous Digital Archive, it was a project in New Mexico with 20 of the tribes there to take the digitized archives that are held at NARA and publish them such that the history of the tribes and some of the injustices, the great injustices that they suffered would be part of the understanding of the community. However, and here the predatory businesses can really come in. Some of those records really cannot be open because they included things like medical tests of students in the Indian schools. Some of those could be traced back to genetic issues which would then impact present day people who are alive and potentially an insurance company scraping all of this data mining through it would say, oh, we know who that person is and we know that this person is a customer. We are going to suddenly deny them coverage, not because of an existing condition of that person, but of a long past existing condition of one of their ancestors. Similarly, there are many different cultural traditions about openness of images and openness of content that we desperately need to be more aware of when it comes to this side of thing. So when I talk about industry as a partner, I'm thinking more along the lines of the AAAF community and also of the Sempera community and similar where there are partners who are for profit that are part of the community, not merely interlopers. So by having APIs and consistent data structures that can be published using open source software and commercial software that can be consumed using open source software or commercial software. We can ensure that any predatory organizations would not have the vendor lock-in that other technology companies in the current era have. So it's not that I want a Facebook quite the opposite. It's that I want there to be partners such as the Girassi or such as DCI in the Sempera community that behave in a respectful way and can use the openly agreed upon standards to derive a profit for providing a service. So this is sort of the W3C model where you get Google and IBM and Mozilla and Microsoft and you name it. They all come to the table and they all agree amongst themselves, or as a member as well of course, about standards and then go off and implement them separately and advertise their own products. But at the end of the day without that standard, there isn't a market that can be moved forwards. So I think I do believe that we need both cultural heritage organizations, research organizations and industry partners to ensure the sustainability of all parts of the ecosystem. So that's a very important distinction about the nature of the partnerships. We have another question in about the extent to which rightstatements.org URIs can help with this ecosystem. There's quite a bit of mismatch about the willingness of, or let's not say mismatch, but inconsistency about the willingness of institutions to provide broad access to collections. Some of them are really worried about copyright issues, others feel they're on much safer ground there. Do you think that these sorts of more flexible approaches to dealing with right statements than the sort of very simple partitioning up of the world that Creative Commons license do are going to be important here? I think the very good work that the rightstatements.org folks did, and it was, if I recall correctly, European and DPLA as two of the primary participants in that. They looked at some 80,000 different right statements to try to find one of the commonalities in those statements and derived on the order of 15 or 16 such things, including things like no known copyright. So the organization is asserting with that one that they looked for copyright and they couldn't find any. So they don't, it's not that they're saying this is public domain, it's that they're saying here is an object for which we do not believe there is copyright, but we could be wrong. Or similarly copyright not assessed. I feel exactly the terminology used for that one is we're putting this out there, but if we didn't even look if there's copyright you tell us and we'll take it down kind of approach. So I think this, there are two good very, very good things about this one is it's a small vocabulary right there is 15 or 16 terms, each of which has an identity and identifier the URI that can be used in data and it will always mean the same thing. So this is that comparable way this is data for machines, not the 80,000 different right statements, which would be data for humans. So then now we can start to build interfaces and automated processing systems that rely on the presence of that identifier and can behave differently in the different the 15 16 different cases. So yes, I fundamentally agree that by annotating our data with right decisions that are comparable in this way, we will move the system overall forwards and dramatically, hopefully then to the de Melissa's question and point. And that will mean that there will be more confidence from the publisher side that what they're doing is part of this broader ecosystem and they're not sort of the tall poppy and the lawnmower is coming to to knock their knock them down. Thank you for that. Thank you for that question. Let's take one more question if there is one. I think you have left people pretty speechless at least. Thank you.