 So I'm Joshua Gomez. I work at the Getty Research Institute as a software engineer, and this is Emily Pugh. She's our Digital Humanities Specialist, and we're undertaking a big task at the Research Institute to rebuild a 30-year-old software system known as the Provenance Index. So you want to take it away? Sure. So I'm Emily, and as Josh mentioned, I'm the Digital Humanities Specialist. My background is in art history, and I thought I'd start by telling you a little bit about the history of the database and of the project to sort of explain a little bit about what the data is, how people use it, how they have used it, how they're using it now, to help you understand, give you a little bit of context for what Josh is going to tell you about because of this project, which is just at its early stages. I'm going to emphasize. So what we call the Getty Provenance Index is actually a set of, depending on how you count it, 6, 8, or 12, multiple databases, which together contain one, more than 1.5 million records. And I'm showing you an example of one record, lot 0352 from the British sales database, and this is how it looks in the web interface. So if you were to go to our website and search the database this way. The databases cover primarily Western European art, especially painting, and they span from the 16th to the early 20th century. And just to say a bit about what Provenance Research kind of is, researchers talk about it in terms of tracing a work of art from the wall back to the studio, back to the easeling studio. And they undertake this kind of research to do things like authenticate a work of art, research its authorship, to research artistic influence. So for example, if you knew a particular painting was in the collection of a teacher of a particular artist, you might say, well, perhaps he or she was influenced by that work. It also helps curators who might discover, for example, that a particular painting in a curator's collection might have belonged to a triptych, these kinds of research issues. Also legal issues to prove that artwork was acquired legally, for example. So the Getty Provenance Index is overseen by the GRI's project for the study of collecting and provenance, also known as the PSCP. So you will hear me refer to the PSCP. This is the staff that manages and produces the Getty Provenance Index. So as Josh mentioned, the GPI was established in the 1980s, and initially it was a book series to support provenance research, showing you an example here of Volume 3 of the British Sales Index and an image of one of the pages from that index. The databases were built to fuel the production of the print publication and were built in the QuadraStar database system. It's, as I mentioned, a series of databases, and it was created, QuadraStar was to serve the databases needs of libraries, archives, and museums in particular. So the staff began with transcribing inventories from private collections in the Netherlands, Italy, and Spain, and in France from 1550 to 1840, as well as auction catalogs that, of course, document the sale of artworks. And the books and thus the database were designed to support the traditional sort of wall to easel provenance research. They're based on looking up a single work of art, primarily to find out who owned it at a particular time, so you can see why in the paper version you'd look for your work of art in this way. However, with the rise of the worldwide web, the electronic databases as themselves, people started to realize those could be a research tool rather than a print production tool, and a web interface was built to allow users to search the database directly, which I showed you an example before. However, it was still focused on supporting this traditional use cases of looking up a single work of art. So over the last decade to 20 years, of course, technology has continued to change, but the field of provenance research and the needs of researchers have also changed. So for example, there's an increased need for provenance information and research tools because museums are now required to do more research on their own collections. These are changes precipitated in particular by legal issues related to Holocaust-era assets. There's also an interest, not just in provenance, but in the history of art markets and collecting. So these researchers are interested not only in ownership, but things like the price of art and how it changes over time, the mobility of objects, where they've traveled and how they got there, the influence of dealers and patrons on the history of art, not just artists or owners. The other thing that's happened is the emergence of the tools and techniques associated with digital humanities, which have allowed art historians to analyze and visualize large data sets in new ways. And researchers have started to find out that they can use this massive set of data about the art market, about art ownership to examine things like the evolution of the art market over time, the social networks of dealers, who was selling to whom, which dealers might have sold primarily versus bought primarily, etc. Now previously research on dealers was done on a case study basis, so you might focus on one famous dealer. Now using the GPI data, people can look at all the dealers in the network instead of only the most famous cases. The other thing that's changed is at the GRI and at other institutions as well, there's a growing number of special collections items related to provenance collecting in the art market. So you're seeing an image here from the Felvermire collection, which are photos that were taken at the Allied Central Collecting Point in Munich, and these photos document the process of repatriating artworks looted by the Nazis. They also have papers of art collectors, of lawyers who worked on provenance cases, these kinds of materials. We are also really collecting a lot of dealer archives, which contain stock books, but also letters, telegrams and other correspondence, photos of their holdings, exhibition histories, what they were showing in their sales galleries at particular times. So inventories and auction catalogs are still useful, but the feel has for the most part mined these. There's not a lot of new inventories coming to light, and so dealer archives and information from dealers is really the growing, where the most data is kind of entering the field. Now the GPI has changed with the field, focusing more on contributing records from dealer archives, such as sales catalogs and stock books. Many of which have come from our special collections acquisitions, so you're looking at an image of a stock book, one of 15 from Goupil & Si, a French dealer, and the scanning of these 15 stock books resulted in 43,700 records being added to the provenance index. The PSCP has also filled gaps in the GPI data through research projects that themselves reflect changes within the field. So for example, relatively a lot was known about the French art market, not so much was known about the London art market, hence the British sales project 1788-1800, a project in collaboration with the National Gallery in London, which charted the rise of the London art market, and the two German sales projects, first 1930-45, and then currently 1920-1930, these projects are in collaborations with German institutions, and they collect information from auction catalogs during these dates that can be used to trace the provenance of art alluded by the Nazis. So while the staff has sought to update and adapt the GPI to these new needs, there are still challenges to the current system in terms of newer use cases and newer research methods. So obviously the databases were built in a pre-internet time and are fairly time and labor intensive to maintain. So I showed you earlier a lot 0352 in the web interface, this is what it looks like in QuadraStar. As I mentioned, but the database and initially the books were built to support, or rather the database was built to support print production and to answer this traditional kind of provenance research questions of looking at single works of art. So there will always kind of be a limitation to its use if you want to use it for something different than that. There's also the issue of a lack of data standardization. Now the PSCP has staff entered information in the GPI in such a way as to preserve the structure and content of the source documents. So for example, you see on the right, death date is registered as AFT.1683, because that's how it was in the source document. So that's great news for researchers who wanted a close connection as possible to the source document, but it's not so great for computers who want to see a single number and not things like AFT. The other thing is I mentioned that it's actually made up of eight plus databases and so they're each kind of modeled differently. So field names in one say one database might call something a date of document, whereas in another it might be lot sale date. Another issue is that source document languages were maintained. So if it was a Dutch sale, all the information in the GPI is in Dutch, yeah. So that's a challenge. Now you can see that some authorities are used. So for example, it has the artist authority here. But in general, there's a need for a lot more data standardization. The other issue is that the current GPI is difficult to access, understand, and use. So this is partly because of that lack of data standardization, which makes searching difficult. If you don't know that your sale or that your artwork was in a Dutch sale, you might not know to search for it in Dutch. It's also difficult to know what's in the GPI because it's been added to over the years as source materials have become available. And so it's kind of, the information in it is kind of unevenly distributed. So for example, the sales databases cover the Netherlands, transactions in the Netherlands from 1801 to 1820, and they cover Belgian transactions from 1668 to 1840. So again, if you don't know the idiosyncrasies of the data and what it covers, you might not know what it's possible to search under or how to search. Another challenge for the researcher in terms of doing research on collecting in art markets, let's say at the GRI, is that it's sometimes ambiguous the relationship between what's in our special collections and what's in the GPI. So for example, you might search for something in a nobler stock book in the GPI, a record from a stock book, and not know or not know how to access the whole rest of the nobler dealer collection, which could have correspondence related to the same work of art or pictures of it from an exhibition history, et cetera. And the other thing that's kind of cut off at the bottom there is that there's a kind of untapped research potential, because in some ways you might say this data is kind of locked in this one system designed to look up single works of art, and because of all the things I've just mentioned, it's hard to know the kinds of things you might even be able to find out using this data. Despite such challenges, we really feel there's an immense potential here and in this resource. For one thing, because of its size and scale, it's one of the largest art historical data sets in existence. Also, despite all those challenges, which you might think mean that it doesn't get used very much, it's actually used quite a lot. We searched over 400,000 times last year, and its use has tripled in the last three years. So it's still a well-used and valued resource in the field. Also, our in-house expertise in metadata standards with the Getty vocabulary program is another reason we feel well-positioned to undertake this project. And it also aligns very well with the growing commitment at the Getty to fostering innovative art historical research and to open data principles. And in fact, this image actually had a copyright on it online, because it used to be that there were really kind of not as open in terms of letting people use this data and sharing it. And now that has really changed, and we're very much encouraged to just let this data go. So hence the Getty Prominence Index remodeling project, or PIR project. Now, these are its four main principal goals. Increasing accessibility and usability are really at the top. We want to make it easier to use in its interface, but also we want to make the data inside of it easier to get at and to use. But we also want to make sure, in terms of the bottom two, we want to make sure that this new technological infrastructure and set of tools are tied to and support the changing field of our market research. So we want to, for example, at the GRI connect, strengthen the connections between the Prominence Index and what's in our special collections in the library. And we also, along the way, when you try as much as we can to link the research or the art historical size of the project with the technological size of the project. And I'll tell you a little bit more about that aspect of the project in a bit. And then first, Josh is going to talk about these first three components. Yeah, so these are the four big pieces that we have to tackle to do this. And actually, this actually misses the part before this is actually to kind of gather the requirements, you know, the use cases, what exactly are we trying to do with this? And then we can move on to building a data model to serve that. And the user interface and the data modeling kind of go together, each one kind of determines how you do the other. So it's kind of going on at the same time. And then the software development, we probably won't get started on until later this year, probably in the fall. So thank you. So to get the data modeling underway, there's kind of three main steps. First, we have to take the records all out of these databases. And I should reiterate that these are about eight heterogeneous databases, and they're not even relational databases. They're so old, they're flat file databases, which is pretty uncommon technology these days. So once we get the data out, then we have to clean it up and try to map it to some standard vocabularies and ontologies. And some people's link data projects, they will roll their own vocabularies or their own ontology, and we're going to try to stick to standards to make the data more useful. Now, all of this requires a lot of expertise, some which we have, some which we don't. So we're doing our best to acquire the tools and expertise as needed. So I think every link data talk is someone's first, so I'm going to give a kind of a quick overview of what it means to convert to link data. So with RDF, you convert all of your data into these three part statements of the subject, predicate, and object. So in this example, which is very simple, just the three column table, you can see that we're actually creating two statements from one row. So irises was created by Van Gogh, and irises was created in 1889. So you can imagine with a bigger database with many columns, you're going to end up with dozens to hundreds or even thousands of these statements per row. So you end up with billions of these semantic triples. And that's only the first part. Once you do that, then the idea is to convert those into unique identifiers so that when this data is published, machines will know that the person that you are referring to is the same person that this other organization was referring to. And hence you get link data. The machine is able to understand that these things are linked. So the common phrase is things, not strings. And this is useful for disambiguation. So even with someone like Van Gogh, there's a dozen different ways to put his name into the database. So we want to clean it up, standardize it, and convert it to this unique identifier. And this works well not just with variations in spellings, but also variations across languages. So something is referred to in a different way, in a different language. You can use the same identifier and everyone, every machine will know that you're talking about that same person. So here's an example of a vocabulary that we're planning to use, one of our own, right? So this is from the union list of artist names. And you can see in here that these are all the triples, well it's cut off, but a listing of all the triples in which Van Gogh is the subject. And then you can see in the little tabs that you can click on these other tabs to find out the ones where he's an object. I don't know how many where he would be a predicate. I'm actually kind of curious to click on that tab now. So this is just an example, right? So some of the other vocabularies that the Getty has put out are that the source of geographic names, which gives you places, the U-Land gives you people. Kona is not ready for prime time, but the goal is to give you identifiers for things. So the actual works of art. And I can't give you a timeline on when that will actually be out. That's not my responsibility yet. And then the AAT kind of gives you adjectives and themes and topics. So these are all great for the subjects and the objects of your three-part semantic triples, right? So what about the predicates? Well that's where things like ontologies come in. So an ontology is a model of the universe or just at least one piece of it that you want to talk about. And it describes the relationships between things. So you end up using these for your predicates. And just a very simple example would be Dublin Core. So you can say that somebody created something or it was created at a certain time. Now this is a very simple one and we won't be using this one. We'll be using a much bigger one called the CDOC CRM, which I believe stands for conceptual reference model. And this thing is a beast. This thing is a gigantic. The spec book is about 200 pages long just to read through all the different elements that you can use in this. So right now we are kind of trying to take just one of the eight databases and figure out how to map it into this. So we are using, we're actually trying out a couple of different tools. So one common one for data cleaning is open refine. I definitely recommend that. That's a wonderful tool. When it comes to mapping, it's a little more difficult to give a recommendation. I particularly like Karma, which is an open source technology. We've tried it out and the person that runs our provenance index now is trying them out. And she found some limitations with that where it was making assumptions for her that she didn't like. So then we started trying out this other one called 3M, which is built by some people in Crete at a computer science institute in Crete. And she's having some luck with that, but she's also finding limitations with that. So we can't give you a good recommendation yet, but I can give you a couple screenshots of what they look like. So this is 3M. This one I'm not as familiar with, but you can see that you can kind of ingest a source document and then modify which elements in the CRM that each column in your database is supposed to map to. And then once you're done, it will spit out an RDF file of the entire database for you. And then Karma will do the same thing. I kind of like it because it does this nice little graph visualization on top of the data to kind of give you the bigger picture of what it is you're working with. And you can also do some scripting inside it so that you can split values in a particular column apart, as we're doing here, or you can recombine different parts. So I kind of like it. We actually might give this one a second shot in the coming weeks. Okay, so this is kind of our work in progress, or Ruth who is the one that's been working on this. This is kind of one tiny piece of one database mapping. And it's already got a couple dozen elements. Once we map the entire thing, it'll probably have a couple hundred elements per row. So we've been in talks with some of the people at 4th, which is the people that created 3M, and some of the other people that are actually the people writing the spec for the CDoc CRM, and they have actually, I believe, come up with a few new elements based on our use case, having to do with prices and currency, which is difficult to map with the current element set. So those are tools, but we're still kind of struggling with some of this, so it's good to just bring in expertise if you can. So we're hiring two people. We've identified a candidate to kind of help us with the data extraction and cleaning and whatnot. And then our big win recently was hiring Rob Sanderson, who should start at the end of this month as our Semantic Systems Architect, and he's going to help us with this, but not just with this project. It's exciting because this position was created to kind of coordinate the activities around linked data for the entire Getty, not just the Research Institute, but also the Museum and the Conservation Institute, the Foundation, and the Trust at Large. This way, all of the things that we're doing will be on the same page. Okay, so let's move on to actually building the new system. The way I see there's four main components. There's the public interface and the staff interface for the staff that actually enter in the new data. I consider the API and interface as well, I consider a part of the front end, and then also bulk downloads, because we had a workshop last year where we invited a bunch of provenance researchers, and bulk download was one of the top requests we had from them, so we're going to try to address that as well. So some challenges with the user interfaces. This has to accommodate both the traditional use case of looking up a single artwork and finding out where it came from, but also address these new use cases of computational research and looking at things in the aggregate. So we have to create an interface that allows people to do both of those and do it well. And on top of that, we're now working with a very abstract data model. We take this very abstract thing and make it intuitive in the user interface. But despite those challenges, we can do some pretty cool stuff with this once we convert this data over. It will enable a lot more dynamic information processing, and we can do all kinds of data visualizations with this that I think will be both illuminating and just really nice to look at as well. And then being able to connect our data with data in other places will also create a much richer user interface. So my design method, I follow the advice of Alan Cooper. He has a mantra called a goal directed design where you find out what the user's goals are and you develop a system that enables to achieve those goals. So the main workflow is to first you conduct interviews and figure out what it is these users are doing, what their needs are. You create personas, and from those personas you create scenarios which are basically kind of like imagined use cases. And then once you have that, then you actually have your list of requirements. And after that, you actually begin the real design of what is it going to look like and how is it going to operate. Now for this computational research methods, the API is the user interface for these type of research rates. We have to do a good job creating a very useful API. And so the challenge is, do we want to put up a sparkle endpoint or create a respite based API? Sparkle is a very complex thing, but it is a standard. But rest is also a standard. So we might end up just building both. And then the bulky API is probably the easiest piece, right? We just put it out there. So going back to this sparkle endpoint in triple stores, triple stores have a reputation of being quite slow. So one of the things I've been thinking about is just putting everything into Elasticsearch, which is a very scalable search engine. You can just dump a bunch of data in there and it is still very fast. And then I will just build a REST API on top of that with a web framework like Django. And then there's a typo in here. It says use a triple store. Right. Sorry. That's not a typo. We might also push data out to a triple store and sparkle endpoint for people that actually want to get to the data that way. That's easier than trying to map their complex queries into the system that I'm hoping to use with Elasticsearch. And we're actually in luck with this because the Getty, a different arm of the Getty, the Conservation Institute has already built something similar to that. They built this thing called ARCHES and it can ingest CRM based data. And it's Django and Elasticsearch combined, which is what I was planning to do anyway. So there's a very strong possibility that we can start using this instead and that will put us way ahead. So I'll just give you a quick idea what ARCHES is. ARCHES was built to catalog immovable cultural heritage. So basically sites like Stonehenge. And it wasn't built as an end application. It was built as a platform for other organizations that do this type of work to take and then build their application on top of it. So this is actually quite nice. We have a platform ready to ingest CRM and we can just kind of build a new application that sits on top of it to ingest our provenance data. So just to finish by telling you a little bit about what I ended with which was trying to tie the research side of this project with the technology part. We think it's really important to, as we update the GPIs infrastructure to be able to adapt to the current needs of the field, but also continue to adapt as those change so that we're not back in the situation where we have this thing that was purpose built for one thing, but the field has moved on. So the goals of this part of the project are to design you as it supports, and to map the data in such a way that supports the queries that researchers need and want. And I think mapping with this project will never be over in a way. We'll be continually mapping all the time to try to get it right in terms of what researchers need. Also establish methods for tracking changes and updates to the data, and for communicating those changes and updates to users, which has been an area that there's been room for improvement, let's say. Because as editorial decisions are made, for example, we need ways to kind of maintain the provenance of the provenance data so people can always follow it back to the source document. And also try to position the provenance index as one part of an extensive set of resources, because a researcher will never begin and end with a single provenance record. So I'm showing you a Buberot from the Getty's collection, and a telegram about it to Getty from the Duveen, to Duveen, which is an institutional archives. It's pictured in a book about the Walters collection that's in our special collection, and it's mentioned, this painting is mentioned in a Goupil stock book. So you can see how all these things kind of link together and are part of doing provenance research, and we want to sort of strengthen those connections conceptually, certainly materially as well if we can. So some things we're doing to achieve those goals are, we are establishing a research advisory committee that includes the provenance me, the provenance data specialist, the postdoc that Josh mentioned, other experts in provenance research to conduct workshops in symposia, to come up with a content development plan. I mentioned that the provenance index reflects the historical biases of our history. It's mostly Western European painting. So should we start adding deck arts objects? Should we start adding non-Western? How might we go about doing that? And to advise in the development of data standards. We also have an extensive publications and communications plan for the project. We're redesigning the website extensively so that people can understand, maybe create data visualizations to communicate what's in the provenance index, where it's really good, where there are gaps. A guide to resources at the GRI so people can understand what's in the library versus what the PSCP is doing. Perhaps a pilot publication highlighting people who are doing the art market kind of research, the data driven research, and making it transparent how they're doing it, what tools they're using so that people understand what the... Because at this point those researchers are a fairly small group, but a big part of this project is to encourage that kind of research. And then lastly, perhaps create a data registry where we disseminate these standards for how you should model your provenance data and share it so that people could... Because people, we imagine, will download large data sets. They'll have their own particular research question. They'll edit the data to fit their research question, and perhaps they want to contribute it back to the community so that other people could use that data set. So we're thinking about ways to not just establish data standards but promote them. Okay, so finally, our timeline. So this project is just now getting underway. So we're doing... We're kind of experimenting with data modeling and really waiting for Rob to start to kind of help us out. We've been conducting interviews with various researchers, both the kind of traditional provenance researchers and kind of the more computational art market type of researchers so that we can kind of build personas for both types of people. And then also playing around with the different tools we might use to build this. And that will probably go on through this summer and we'll try to start software development sometime in the fall or winter and start building up documentation about what actually is in the provenance index and start to model some research projects we can do with the new type of data. And then in about a year or so, we can start doing some testing on whatever prototype we have out, start doing some outreach. And then the big piece is actually changing the entire workflow of the provenance department because it's going to completely change the way these people work. Right now they're doing data entry on a record basis and now we're converting to these semantic triples. How does that change the way they think about their work? So that's going to be a big one. So we're expecting this to be about a three-year project and we hope to kind of give you updates here maybe once a year. I'm hoping. And that's it. Thank you.