 So atho te reika, anua, menei kevutu. Te reika te reika. Sabu sylva atapasik i wele tu aksa, anua atapasik ni wele. Akei, aoe i beisio te reika te reika, me ni maikai i wele i srw Maikana, maikai i tawos, maikai i wele, maikai i te reika i te reika. Mi i i ta, fansi i maikai i kaitanatandam, maikai i te reika i te reika i te reika, maikai i te reika i kaitanatanda. onlain scientific reference sample collections and how we've been working with other colleagues to actually integrate different initiatives that are underway as opposed to us all reinventing the well. So in terms of what's really behind this, with the advances in technology and particularly in scientific analyses, we're really generating large volumes of data and how do we actually capture this data and make it available? At the Library of Congress we're in public domain but I realised in fact that most of our data for the scientific analyses of the collection resides on individual non-network computers and it's not actually made easily accessible. It's also not connected together back to the same original object. So these are big challenges we have. Because we're in cultural heritage also, we're multi-disciplinary and we are working with people from archaeology, from chemistry, from physics, from humanities. All of these different types of data need to be actually integrated together and these are some of the huge challenges we still continue to face. One of the challenges we have with the scientific data is the complex types of data sets that we're creating. And so while we think that it might be more simplistic, I'm sorry, this is, I've gone too far. Try again. I'm looking at the wrong side of my screen. I've got a dual screen, sorry guys. So one of the challenges for heritage science data sets is that it's a volume and we've been working and trying to really engage on a global scale with our European colleagues. And this I think is a critical component because we can't actually work alone. What's been very interesting is looking at what they're doing in terms of research infrastructures throughout Italy and Europe and really engaging in collaborative ventures. As opposed to infrastructures here in the States, where we really aren't actively engaging effectively. And that's really a challenge for us to try and get the data together and get to work effectively. The standardised protocols is also a challenge when you're talking to scientists because everyone wants to do their own procedure, which is actually perfectly fine as long as you capture all the metadata of how you do that and make that available so someone else can access it. So my scientists were my initial guinea pigs and so that was the initial really good resistance I had to how scientists like to resist being put in a box where you'd like to resist being put in a box. But it then got turned around to last year when we were writing a grant and one of my scientists came to me and she said, you know that stuff about structured data and the metadata stuff? Can I take that please? Because I'm writing this grant up. So it's interesting how people start to think about what they need when they need that information. As I jumped ahead before, we're working with a wide range of materials and each of these materials require not just one but multiple types of analyses. We actually look at things over time. We retest them. How do we actually look at all these different components linking in environmental data and how that impacts the impact of treatments? And I've got some examples later on just to show you the types of data that these deal with. And so those temporal, spatial and spectral components really give you another layer of complexity as we look through that. I also want to note that these data challenges are the same for all disciplines. We tend to think that we're isolated and alone when we're looking at cultural heritage and it's not the case. The same challenges of how can we get sustained access, open source file formats, the capacity for link data, having a more integrated approach so that these things are really, truly linked. And looking at the fact that as I said, we have a lot of related disciplines, we work with all of them within cultural heritage and how do we actually bring those different types of approaches together and move forward? So one of the groups that I've been working with a lot recently is the Research Data Alliance. And this really is the face of the USA in terms of the global component. I must admit that it was very daunting when I first joined because there were 72 interest groups and I said, someone said, create a new one. I'm like, why should I create a new one? There must be at least five that are already doing things that I want to do. And then really looking at what was the shared data structure and terminology across the different disciplines. When I went to one of the chemistry interest groups and they said to me, well, your cultural heritage, so you had your own terminology. And I said, actually no. We're chemists and we're physicists and we're archaeologists. We actually want to make sure that whatever we're using, that terminology is the same across. So people really understand what we're talking about when we talk about cultural heritage. And so really trying to link those vocabularies together and have a better engagement. So the classic linked open data, how do we really do this? A really nice example was from that same chemistry working group when they started talking about the International Union of Pure and Applied Science, the IUPAC group, Gold Books and Green Books. There's a series of books that have all of these different types of data within them. Very extensive level of data anywhere from what is this molecular formula of iron chloride to how do we define illuminance in terms of light, all sorts of terms that we use a lot. But why should we reinvent those terms when they already exist and we can now then link to the original terms from the IUPAC books? So what they are actually looking at is trying to get those out and truly link to data and make those accessible for people to really connect with. So that's something I keep trying to do. It's like I don't want to recreate new terms when they already exist in other initiatives. So the other group who works very closely with RDA is the National Institute of Standards and Technology. And this seemed to me to be the absolute first place to go when you start talking about standard reference materials. And Bob Hanich, who was one of the co-authors for this, the Office of Data and Informatics, this is just sort of an idea of the types of areas they work in in terms of the standard reference data, the research data, the data science and then the different communities. And what they're really looking at is, from those different areas, the discovery component, the access component and then how do you interoperate? And this was really helpful for me to sort of look at how we were trying to do this and engage with setups and structures that they had already developed from their repositories and the types of data that they'd captured through their persistent identifiers and how then they were actually curating this data. In terms of what they actually have there, you have the public data access policy and then a wide range of standard reference data and samples. And what I really just want to drop into is one of those data sets and keep moving through this, materialsdata.ness.gov. And so this is actually one of the data sets that's got an extremely extensive volume of data within that and also within that, a wide range of different communities that actually present and contribute the information. So you can see here all the different communities within that and then if you actually go in and look at an actual sample entry for one of those, you actually have this sort of type of data, you have these assessments and then within it you have digital identifiers, you have related work and similar work. So they're really starting to try and link together the type of information they've got there. And again here you've got material resource registry where you can do this huge search across their different databases for types of materials. So coming back to what we were trying to do with the online scientific reference data, really sort of starting to look at how do we get the sharing component going between institutions and there's been some great talks already today about this wonderful focus on collaboration and lack of resources and how do we start to really engage in this. And so what I really wanted to look at was how do we actually get great credence for the type of work that we're doing for those scientists in the room. When you talk about cultural heritage scientists to other pure scientists, they're like, yeah, but you're not doing real science. And then they walk into the labs and they see some of the issues we're working with on how do we actually preserve the deteriorations of anything from glass to paper to pigments and they're like, you're doing real science. And so that's why there's importance on having that shared terminology and engaging through that. It's really, really important. And the capability to actually link the analyses of different types of instrumentation back to the one source sample is really a critical one. So Bob Hannitch from NIST when I started talking about this, he's like, do you know Kirsten Leonard? From Columbia University. And so this is where we really started to pull some of these initiatives together. So Kirsten Leonard is working on the International Geosample Number Database. And this I think is a really fantastic initiative because essentially the IGEC number can be integrated into DOI metadata. It's actually tagging the individual sample that they're using for analyses and then linking that through any publication. So what we have here is persistent identifiers for whether it's specimens, sampling features and collections and looking at how we can reuse this structure for different types of databases. So it's resolving virtual sample representations and currently there's an international nonprofit organisation that's bringing that all together. So how it works, it's really facilitating internet-based discovery and access of the physical samples for reuse and reproducibility. And it aids the identification of the samples within the literature. And I think that's really critical, particularly in this day and age where we're putting out lots and lots of papers and then suddenly someone has to retract a paper because someone tries to reproduce their data. Well, this way you can actually go back to the absolute original sample and verify that data. So in terms of data authentication, this is really a critical fantastic component for moving forward. And IGSN has also been adopted by publishers so they really love the fact that they can actually link back to the original samples and prove the authenticity of the data. And through this, essentially you're linking the samples, the data and the publications through this absolute barcoding and the identification of those samples and even through IGSN is now in data site. So moving back to what we're trying to do in research and testing, so how do we really develop the Sustainable Integrated Heritage Science data model? So we're seeing a lot of other initiatives, we're doing parts of it, but how do we start to take it a little bit further and really link the components together? And one of the things I think that's really interesting is that the critical elements for an open federated database are the data access and interoperability. I mean, we really have to focus on that. And one other thing I wanted to mention is a true understanding of open access because I think we often use the term very loosely and it's not until you get to the end of a project and start to actually say it's open access, a lot of people then suddenly realise that they really haven't put everything into place for the requirements that are needed to make that truly open. And so I think that's sort of turning that around to really thinking about that at the beginning of a project is something that we really need to focus on. This is one of those lovely scary diagrams that you hope that you'll never see, but Joe Padfield at the National Gallery London just really starting to look at, and I wanted you to sort of look at the complexity of this, so when we have all these different things, so we have the object, whether it's the artist and where was it created and who was the group and what was the time it was created and did it move somewhere else? And then, you know, what analyses did we do on this and how are they all linked together? This was, I think, is a really interesting structure that makes most of us want to screen running from the room, but how do we translate this into something useful? So I realised early on that I was trying to do too much too soon and create everything in a very short time period and it doesn't work that way. So I pulled back and said, okay, let's have a phased approach. So the first part of this was actually just to start with our reference collection and fully catalogue and describe all of those reference samples. I mean, I say that this is not actual library collection items. This is physical scientific reference samples that replicate all of the materials we have in our library, whether it's elements, colourants, parchment, storage media, all of those types of materials. And then taking the next phase of when we start to look at integrating all of the different types of analyses back to that original sample with the intent to be that once we have all that structured, we can start to link that back to the actual library collection objects. So it first started with this very simple, relational database structure for cataloging the actual samples. And then the challenge became how do we really think about the types of collections and the types of analyses we're doing? So I'm just going to give a couple of examples of how complex the types of data can be. So looking at large data sets, spectral imaging, where we're taking multiple captions or images of the same object right across the spectrum, you can really easily create large data sets. So how do you make those easily accessible and then link and understand the different types of software? So because it's the end of the day, we're going to have a series of images just for you to look at and relax before we move on to the next phase of the evening. So this is basically what we do. We image right from the ultraviolet through the visible in it to the infrared and then within that data set, we actually engage and embed the metadata of how that's captured within every file for that. And we worked with the company to actually structure this. So from that, this is literally, we get this cube of data and you see as we go through, things appear and disappear and we're really looking at that to understand what's the nuances of the information within our collections. We then can process this in a number of different ways to get a wide range of information. So if we start with this, not very large, given some of our collection items, 24 by 36 inch map, if we just image the front and the back and stitch it together using like around about 600 high res, we get about six, plus six, about 12 gigabytes of data. But then the fun stuff is when we actually do the processing. So then we actually go in closer. We can go in fact, our system go up to about 6,000 DPI, which is really great fun. But even if we just do high res on some of the areas for the curators, we now up to about 24 or 30 gigabytes. Well, then we start processing it to look at some of the different types of information. Okay, we're increasing the volume and how do we keep making that accessible? The other thing that's interesting is we're really interested in how treatments impact these materials as well and how we're actually protecting our collections. So now we're looking at something before and after just with a colour image. But then when we start to look at some of the processing and how we actually look at how these things are reacting to different treatments, we're really starting to get a wide range of different types of data, from spectra, from images. This is actually a scanning electron microscope image of some fibres. So you can have the morphology of the fibres and then we actually capture the information extracted from the instrument of how we capture that. We look at what type of elements are available across those different fibres. We map them and then we also look at how we capture that. So from what might look like one simple analysis, we've now got in five or six different types of data. And all of those are important for us to actually understand and be able to interact and engage with other people at other institutions to think about what they might want and how they might better use or reuse our data. So data capture challenges because they're always there. How do we actually expand the data model and this is what we've been looking at to incorporate instrumentation and temporal components? So how do things change over time? And that's something Alberto and I are working with at the moment. How do we actually capture all of the instrument and analytical information and the metadata rather than relying on that handwritten lab notebook, which is what has been happening for years? And that's been complicated a lot by a lot of non-proprietary file formats. Over a period of about five years of working with manufacturers, they have finally realised that we don't really give a toss about what their file format is. We just want to extract the data so that we can show people how we capture that information. And then what's the inappropriate format for a viewing platform? For example, for those spectral images, embedding JPEGs because they're much smaller for the viewing and then linking to a larger database for people to access the hybrids. So we have this scheme for including the scientific analyses and then starting to look at the relationships between those different components, from the research project to the types of files that we're capturing, the instruments and then how we're ageing these over time. And just to give you an example, we have the entire Barrow Collection that's been barcode. We have every different type of magnetic tape that's been created, an original parchment in our top right that was created for me in Israel and shipped and it really didn't smell very well when it first arrived, through to pigments and then paintouts. These are all the different types of reference samples with challenges we're looking at. So with collaboration components, we really wanted to minimise the effort for other institutions to engage with us. So part of this was seeing how we could actually script and bulk upload other people's reference collections to engage, to ingest in this. And we've had a lot of interest from other colleagues with doing this. And then the sustainability, of course we have to really carefully structure that create very high standards so that people really understand how we're capturing that and linking those protocols and data specifications to each type of analysis. And that comes back to one of my early points which is we don't need everyone to do it the same exact method for capture, we just need to know how they capture that and extract that metadata to better share that. And I do mention institutional IT policies because that is very particular to the institution as to what types of platforms are available and how those are supported and changed over time. So how do we actually then, from a visual perspective, look at and engage with this type of data? And the term that I've created here is scripto-spatial. So you literally have a visual rendering of the object and link all of the data to it. So this data visualisation interface is essentially using an object-oriented approach to be the front-face of the underlying database. So this is a way to try and engage more effectively with humanities to help them see and understand that scientific data is actually really important in terms of the content and it can be very useful for their understanding and their utilisation of collections. So I have two examples here I just want to run through very quickly. The VoltaeMula 15007 World Map, first map to refer to America. And this is the reference to America down there. And the two central sheets originally had red grid lines that had faded. So with the processing, we could actually recreate and bring back these red grid lines and processing a little bit more. You can really see where those start and stop. So this is the type of information that is of a lot of interest to a cartographer. We then can also look at the watermarks in terms of provenance and I'll be coming back to that in a minute. And then looking at specialised image processing, we can actually create a combination of raking light to look at what the original woodblock might have looked like. And this has been really interesting to look at some of the early in Canabula and really think about what were the printing techniques and did that make sense from the tool marks that we're seeing at the time. Through to some of my other colleagues who are saying, well, this is what the world looks like today. And that's what they drew, which is pretty darn close. So who were their sources for actually creating this map? Obviously someone had travelled round there. How did they know this? And then this is just a rendering, but if you're literally taking one of those sheets and then you're starting to add some of those analyses, so you're looking at what sort of elements are we seeing in parts of the ink? Are we actually also seeing fiber particles and ink particles? When we look at different renditions, we can separate out the printing and annotated ink on the side, and then we can take this further with that processing and really look at what does that look like when we actually go in closer to really look at that depth of the processing. What's also interesting, you can take this one step further and really start to look at linking documents through a geospatial image data set. So here we've got a Ptolemy Geographia from 1513, which we were looking at for preservation perspectives and trying to understand what was happening from some of the maps that were not in very good condition. One interesting component was that there were three different types of paper, two with watermarks, one without. And as I started looking at this, I was like, hmm, one of these I've seen before, and this was the crown watermark, which also appears on the Voldsamela. So now, all of a sudden, we've actually linked two collection items from two different special collections that before no one had any idea had any connection, which I think is a really exciting way of bringing our collections alive. So to cut a really long story short, we actually researched and found out that this group actually originally got together in San Diego, France around about 1506 to actually print the new copy of the Geographia. They then got funding, and as one does, you stop one project and go on to another. They paused that project to print the Voldsamela map, their sponsors unexpectedly die. Now they've used up a lot of the good quality paper. They then regroup and strut usbog and then finish completing the printing of the autonomy. So this is how we got the good and bad quality papers in the Tomy Geographia. So that's sort of gone from hardcore database to how do we visualise it and bring it all together. But I wanted you to see that it's just a really interesting way of looking at our collections. And if we can't actually make this scientific data available in usable ways, then a lot of it is just not ever being reused and reproduced. So just in conclusion, I just wanted to say that I've really, as you can see, these national and international collaborations are really a critical component of building these online data structures. And in two days' time, I'll actually be at RDA Research Outer Alliance in Barcelona with some of my Italian colleagues and European colleagues. We were looking at presenting the European Research Infrastructure for Heritage Science, ERIS, to actually look at how can we engage more on a global perspective and really pull more partners in terms of that. There's an urgent need, I think, for standardised open access, shared heritage science data and hopefully we're starting to move towards that. And we're hoping that the Class D initiative will actually help people from small institutions to actually engage, put their data online and help to work together. So thank you all for listening.