 Okay. While my computer is getting habituated, thank you all for coming to POD, the Platform for Open Data, or I think we called it Building Data Lakes for Libraries. And my computer is doing the beach ball. There we go. Before we begin, this was a, it's a good thing this is a panel, but we have an even larger group that could be up here. So there are a number of other members of the POD board who are listed on this slide. And because one of the reasons we wanted to do this was maximize kind of networking or connections, if you are a member of the POD board in this room, could you please just stand up briefly? We have the odd, we have the room surrounded. That's one way to make sure your presentations are well attended is just to have a big team. Okay. So my name is Tom Kramer. I am from Stanford University. I'm joined by my colleagues Elizabeth Long from University of Chicago and Nora Demek from Brown. And we are going to talk to you about Platform for Open Data. So here's the conclusion. So if you need to run to another presentation you can. But every library project is a data project at this point. There is an enormous amount of friction for pooling data. And it's primary logistical rather than technical. At this point there is tremendous appetite within consortia to be able to pull their records and their data together to enable innovation and new services. And what we have found in our journey is that data lakes are a powerful approach to meeting these needs. So here's the setup to that conclusion. So libraries like to deliver valuable services. We like to work as networks. We like to collaborate and we like to innovate. Data underlies almost everything that a library does or would like to do. So everything from union cataloging, union catalogs and other aggregations to cooperative cataloging, catalog authority control to shared print retention, interlibrary loan, open access and public domain materials, digitization and digital services, identifier services and more. But if you have ever been involved for everyone who held up their hand when asked your questions, it's a pain. So especially when you start to talk, I was recently in a discussion with a bunch of catalogers and we had a question of, well, what is metadata? I said, oh, God. So what data are you talking about? What's the transfer mechanism? Is that one time which is relatively straightforward which is ongoing? That's actually quite expensive. How is it going to be used? What are we going to do for mappings and normalization? And then, of course, what are all the security, contractual or privacy concerns that come with this? This has to be answered every time and for every institution. So in times, every time you go through a data aggregation. And the result, at least in our experience, is that you have an endless series of one off and fit for purpose data exchanges. There's a pretty high startup cost to activate these pools. There are data as a result tends to pool in a few hubs, which is often difficult to extract or reuse because it was maybe built for one purpose and it's not easy to extend it to additional ones. This really slows down innovation. It actually damps down the appetite for people even thinking about what might be possible and it really limits collaborations. So IV Plus, ITLC or IV Plus Libraries Confederation, found itself in this position where actually, this says 2019. I think we've been talking about since 2013 how we pull the data and that goes back up. The primary instigator and the primary motivator for this was BorrowDirect, which is our interlibrary loan service. It's a fantastic service. And what we wanted to do was actually elevate the level of discovery that comes with interlibrary loan by pulling records together so we could do indexing. So the challenge was everyone recognized the need, everyone recognized the opportunity and then we spent six years talking about how to do it. POD is ultimately the response that we came up with. And one of the things about POD is we didn't just want to look at a one-off need, but we instead had a larger vision for how we'd actually create a platform for data to be able to reuse. So we wanted to create a, this is our mission. So I just said it, that was so natural. We want to create a platform that positions data reuse and service integration as strategic initiatives. We wanted to do this in a particular way. We wanted to do through open and iterative development that leverages the investments in our internal capacities. We did not want to outsource this to a third-party contractor. We didn't want to outsource it to a vendor. We didn't want to have like overseas developers who knew how to do it and then we'd have to contract and contact them to do any enhancements. By doing these things, we wanted to meet multiple needs within IV plus and across IV plus and enable innovation that would not happen with a bespoke or proprietary solution. That makes sense. Who's going to argue against that? Well, we spent six years talking about it, so we did argue about it. To transform that wall of words into something more simple is we wanted to gather a bunch of data, gather data for many institutions. We wanted to pool that for easy reuse. We wanted to be able to enrich the data in ways that we might not even be able to anticipate and then to deploy that to support not just one but varying needs. The way we wanted to go about that very deliberately was to enable innovation, build our own internal capacity and recognize that data is now a reusable asset. Not just research data but library data, operational data. The approach we ended up taking was that of a data lake. We spent a lot of, if you're familiar with data lakes versus data warehouse, a data warehouse is something that you build that is pre-optimized for specific queries. We all have institutional data warehouses for our financial systems, for example, at our institutions. We wanted to avoid the trap of predetermining what data we needed filtering out data before we even got it. Instead, we looked at this notion of a data lake. It was bringing in both structured and unstructured data from a variety of sources, unstructured and semi-structured data from a variety of sources. The critical idea here is that you'd apply schema on read. Pull the data in. Once you have the data, then you can filter the data. You can transform the data. You can do whatever, Juno Club. You can undertake whatever transformations you need once you have the data. If you don't have the data, you can't do that. I think this is a conceptual, it's a small shift, but it's really fundamental because we no longer had to start, discuss what data we wanted to filter out beforehand. Since we started this within Ivy Plus, focused on Borrow Direct, we've actually been gathering use cases for now about two or three years. We've come up with three clusters. There's a whole cluster around resource discovery, access, and sharing. That's both physical and digital access within our network. There's a ton of questions around collection analysis and decision support, and then there's a set of use cases that are just around data innovation and enrichment. I won't go through all of these, but for resource discovery and sharing, we have Borrow Direct or Borrow Direct Digital, something that Ivy Plus is excited to move into controlled digital ending or electronic document delivery, digital resource sharing, catalog matching, deduplication, and clustering. Within collections analysis, we've done a couple of efforts and they've been one-off in ad hocs. Could we actually just create a recurring theme? A lot of questions around shared print retention or collections intelligence. If I want to digitize something, preserve something, or an item's been lost or stolen, how many other copies are there within the network? Then for data innovation and enrichment, we've actually had multiple questions come up, which is within PEN, Jim Hahn has done some really great R&D work suggesting URIs or basically linked data for publishers in a cataloging UI that he's developed. From Harvard, there was a PhD research question as they wanted to look at the art and architecture acquisitions holdings for 13 institutions over 20 years. In Duke, there was a question about what does Ivy Plus hold with Buddhist studies that are in CJK and Chinese, Japanese, and Korean scripts? Or my personal favorite is, can we get a list of all of the triple AF based items in Ivy Plus catalogs? We've also looked at things like could we expand into linked data discovery, linked database discovery, or just IPLC wide metadata analysis and quality enhancements? Elizabeth is going to tell you what we've actually built. Yes, I'm going to talk about what there there is. What we've done so far is build an operational data lake that's running on open source software. We've had that in operation now for a couple of years. It's populated with 231 million records. It's coming from 13 different Ivy Plus institutions. One of the things we're doing now that's in production or will be in production in an hour when we go live with Pod as the place where data is coming from for our reshare implementation for Ivy Plus. In process is what Tom was just talking about, our growing list of additional use cases and a growing list of prospective partners. We have been approached by a lot of consortia. There's a lot of interest in how could we do something similar. We'll talk later about what that could look like in terms of what the growth trajectory is for Pod. Then we are looking in the future out of all of these potential coming up with some real second use cases to put into production and to also look at that idea of do we expand to another consortia for instance. What Pod looks like is we have a website you can go to that is the way that providers can actually look at and understand their data and consumers can actually look at what is there as well. As Tom was saying right now, what is in there is bibliographic data, item data and holding data because our first use case was interlibrary loan, that was the kind of data that we put in right now. We have various ways in which you can get that data into Pod and we have support for being able to get full data and we can also deal now with incremental changes to that data and being able to get out full data dumps or get out the incremental changes. We have all of this in a GitHub repository so you can take a look at it as I said it's open source. This is what the dashboard actually looks like. You can see the data streams that we have. We have statistics to get provided about how many records you have, how many of them are unique. We are looking at that across the full set of data. Then we now have a lot of institutional data that you can kind of explore to understand who is responsible for this data. There is kind of summary information. There is a lot of what is your processing status. The dashboard is very useful for those who are doing this contribution. It can understand where is their data and it's also useful for a person who is wanting to consume it. They can understand who they might contact and they can understand is there new data that's coming? What status is it? What state is it in? Then there is a level that's also very much really for the people who are doing this work that lets them understand. I think earlier we were talking about in the last presentation that idea of turnover and such. One of the things that is kind of built behind this is making it easy for people to understand who are the people who have the secure access to be able to load data, et cetera. We also have a certain amount of data profiling tools. Now, this is, I think, doing a certain amount of profiling of the data itself so you can look at it and actually understand some things about your data, which has proved really interesting and helpful also for people troubleshooting and understanding what they're seeing when they have when something doesn't come out looking the way they're expecting it to. One of the things we have built but not in production is several of the different IV Plus institutions were interested in the idea of how might we pull this data into our catalogs and make it available? So we had several kind of pilot demos of proof of concept of that. None of us have put that in production yet but there was a real interest in understanding what would it look like to have this many records available in our catalogs. This is an example of what Penn built but they were not the only ones who did that. And then as I said, we are going live with POD as the data lake for our reshare implementation starting at 11 o'clock today. This is live information for you. So this gives you a kind of sense of how that works and so one of the things that really helped was rather than reshare having to deal with each of our institutions and figure out, you know, getting that data flow happening, troubleshoot with each of us, understand that we put all of our data in POD, which it already was, and then reshare worked with POD to get data out of that. So it really simplified that process. It actually really helped speed up the implementation schedule for IPLC going live with reshare because of that fact that it became a one-stop process to troubleshoot when there were problems rather than trying to do that with all the institutional groups. And then this side is also showing that potential that we have that is not yet in production to do all kinds of other things with that same data lake. So with that, I'm going to take it over to Nora to talk about what's next. Well, looking forward, we did a thing, right? We built a data lake and so now we're actively exploring what are we going to do with this, right? What does this mean? What are new use cases? Tom talked about some use cases, but there's a myriad of new opportunities that this data lake provides. Can we add transactional data, right? So can we add transactional data so that we can inform projects like shared print retention? Surely that's an important goal for the academy and for continued success and resource sharing. Can we add more data? More institutions to support broader cross-institutional collaboration. One thing we realize is that we still have work to do in establishing what kind of entity POD is. It clearly has potential beyond being a data store, but we all agree it isn't going to be a software project, right? What is the most compelling and interesting use cases for POD? Right now it is a live backbone for IV Plus's BorrowDirect, well at 11 o'clock. It can be a core data service for other small and medium institutions, other resource sharing consortia or partnerships. These are all organizing approaches that we hope to test in the coming year. So if you can see from this chart, we can group these initiatives into four buckets. Each one represents an exciting new possibility for enriching discovery, accelerating the pace of scholarship, and contributing to an ethos of stewardship across the IV Plus libraries and beyond. And I speak for the team when I say that we're all really excited by these possibilities. I think one thing that's extraordinary about our partnership in POD is that as we evolve in technology and build our own less and less, this gives us a way to still be techy. So I think that's super important. Do we continue to grow the data lake? Right? Or do we replicate it? Do we segment it, cluster it, de-dupe it depending on the use cases? We have implications for sustainability, support, and longevity. We'll need to remain flexible in our approach. So what are our top priorities? We're working with Share VDE to see what's possible with linked open data, but this is a really exciting possibility for discovery. Value of linked open data in the shared discovery environment presents a lot of new opportunities for us, and Share VDE is a leader in that. So open metadata is another priority. As libraries are committed to supporting open scholarship landscape, what we have the potential to do with having shared open metadata is to support other projects across libraries that are interesting. One thing that's really interesting to me is how the data can inform our institutional strategies to examine our collections with a critical lens. How can we become institutions that really embed the values of diversity, equity, and inclusion throughout our practice? This data store has the potential to expose the gaps in the scholarly record from which a great deal of the academy is built. Can we examine the collections through a decolonizing or anti-racist lens that holds us accountable to doing better? Can we use the data for reparative cataloging at scale? We have a long way to go to figure out how we have perpetuated ideas that are harmful. We need to know ourselves before we can remediate that data. And more practically, how much does this thing cost and who are the players? How do we keep it going? Tom talked about some exciting use cases. There are more. So there are a number that are very compelling. Data mining is a great example, the Harvard example of looking at art and architecture across the data lake. The 583 field, those are where we put our print retention notes. That's really critical to doing any projects around shared print retention across IVs plus or larger. An analysis could address our notes, particularly around conservation, preservation, marginalia even. Just really interesting opportunities. Obviously the data is essential to collection analysis. Feeding slices to a service like Gold Rush can help us see a slice in time or snapshot so that we can improve our practice around collection development and it's much more accessible than a Jagunda data lake so it's a little slice. And I think there's lots and lots of other really exciting things that we can do with this. So some of the things that the pod community is talking about is expansion models, copyright, DEI initiatives and analysis as I mentioned before and learning about the new ways that we can inform bibliographic knowledge infrastructure using tools like this so that we can critically examine our past and our future. And I would like to say thank you to all my colleagues and thank you to you all for listening today. Questions or discussion? Hey guys, Christina Drummond. So we've talked about the human layer, we talked about technological layer. I'm curious if, especially for pod, if you guys ran into any issues with the legal layer, if you had any issues around controlled access, limitation of reuse, or if you're truly just working with open data so that didn't apply. Want me to take that? So no it is not open data at the moment. So one of the things we've done is try to really balance having a somewhat lightweight kind of governance of the data and yet at the same time you know facilitate making sure that people feel comfortable about what what's happening with their data. So at the moment the data is only available to those who are part of IPLC. And that is something that as as we talked about we're very interested also in going further in thinking about you know what of our data could be open and how to then make it open so that it becomes a platform for using that. But it's right now we have a data what is a data use agreement that everyone who's contributed data has signed that controls what can be done with that data. And we have not consulted with I don't think our legal counsel for any of I I mean individual institutions may have because they decided a lot. We did not none of us necessarily contributed every single one of our bibliographic records because some of them are not. We aren't able to share everything and so we I know for us at Chicago we did not share everything because of legal constraints. Great. Hi. So I'm really excited about this project. I wanted to understand a little bit more about the costs involved. I think it's great that you're thinking about the future sustainability model but the the costs of actually hosting the data and querying the data and making all of that happen. How is it currently funded and and how has it been funded in the past up until now. So we we're talking massive data. It's like almost 300 gigabytes. So which is a joke. It's not it's not really massive. Okay. This thing is on. We we don't know what it costs yet. I mean we can characterize it in terms of there have been two development work cycles with a team a mixed team from multiple institutions that have worked on it over the last three years. We're currently running it on a set of VMs. It's not actually using that much storage space so we could characterize the cost that way. Ivy plus because it recognized that this was really a this there was we got consensus that this was a strategic asset has kind of crowdsourced this right now. It is running on servers at Stanford but with all the data that's pulling in. Because we didn't know what it would take at the beginning to run it we have kind of a two-year honeymoon phase or assessment phase where we're trying to actually meter the cost and come up with some sustainable cost model and it's everyone's expectation with an Ivy plus and at Stanford where we're doing the bulk of the operations that will come up with some way that it will fund itself and if it needed to move from Stanford it could go you know into the cloud or at a different institution and it would still be funded. So we're that's this is a case study right now. Sam Graham Georgia Institute of Technology. Great project a couple of questions around governance how do you handle governance around what data access and what data actually can be housed within the data lake and then in that governance how do you ensure that the data lake doesn't become a data swamp. The so I think Elizabeth described right now what one of the DCN talked about trust and so for all of its foibles Ivy plus is a confederation that has worked with itself for a long time and there's a high level of kind of understanding and trust and we know what our strengths and weaknesses are and kind of joint directions so without getting into a super forensic cleaning and structuring of all the data we felt confident as a group that we can pull our data together we're already doing it in multiple we already have history of doing this we can pull our data together and Ivy plus can get it if you add data to the Ivy plus pool you can take data from the Ivy plus pool so that's how we're navigating that right now and it's a fairly simple rule is don't share what you can't and what you do share make sure that you're comfortable with that if you're taking data from the pool don't do things that you can't do and you sort of know it when you see it it was the it's that old supreme court law or a supreme court test and we're fueling our way as we go I do think this will get more interesting and one of the reasons I think this works is within consort there are lots of consortia that know each other and can trust each other and so for us and you typically those are the ones you want to collaborate with for us pot is a great way for an enabling tool for consortia to work with each other rather than a free all metadata in the world or something like that thank you um have you thought about inter consortial collaboration within the pot I was going to say we certainly one of the things we've most recently been talking about because we are talking about expanding to a second consortia where one of our our members is you know in both consortia but that's raised a lot of questions right now we've talked about what that might mean functionally we could add a lot of data and then just trust this consortia to only access its data and that consortia to only access it's but I think one of the things we've said is we this place is driving us to having an ability within pod to segment it so that you don't provide that access when you've added a different group I think one other thing I would add about that approach we took with our current data use policy is we do put the onus on the user we who would be consuming and you know kind of we have the general guideline but we don't have lots of structures and overview and yes we're going to review it and make sure that we agree with your use we've kind of worked down and I think this comes from a trust environment we've come from an expectation that if someone abuses it we'll then figure out what we do about that fact but we're not going to build a big complicated infrastructure to ensure that that couldn't possibly happen and I think that comes from that idea of a trust environment in which we believe that people will be good actors and that we will have a process if we discover that they're they're not that does raise a lot of questions as you start to scale and think about you know does how long does that can can we go down that road but it's allowed us to move forward without being bogged down by building lots of governance infrastructure before we even had something in operation thank you also this isn't health data we're we started with the really easy thing these are these are records that are already available in everyone's public catalog I think if we ever get to the point of transactional data we'll have to look at this much more seriously that one really nice thing about pod is we figured out we can make incremental progress and provide value without doing the like a complete solution that meets all needs for everything so I think we're in a good position to take another increment either add more data types add more institutions add more functionality and we'll being agile and iterative is one of the project values and people can always take their own data or data out of the pod and add that transactional in their own institutions and do those analyses they don't have to be done unilaterally so we are three minutes over time I think any of the members of the pod board would be happy to talk it the software is open source it is in github so if anyone wants to use it that is great you don't need to talk to us but if you're interested in this idea we would love to talk to you because we think there's some real promise in this direction we're like as Nora said I think we think this is pretty it's it's very stupid right we gathered 300 gigabytes of data and yet somehow it really is it has been transformational within iv plus so please come talk to us if you're interested