 My name's Richard Birkitt and I'm here to do a bit of a talk. It's not a talk that I've done very frequently previously. So it's probably new to you, which I do hope so. I'm going to be looking at unlocking the power of large language models, enhancing user engagement by leveraging the data graph and linked data. And speaking with David a few minutes ago, this seems to be AI that is, seems to be something which is pervasive within the topics that you're covering today. So it's good that I've got something which may be complimentary to that. So I'm not going to take the full hour, which means that you'll almost certainly have time for questions. If you wish, you can follow up by email me as well. And yeah, welcome to day two. You've got those snacks ready. So I'm going to begin really with, if my computer will work. We're saying that we know that really, there's nothing new about this, that people go into the library probably less than we would have liked. We know that work conducted in this case by OhioLink, that library users comparatively rarely begin in the library. And there's many reasons for this actually being the occurrence. But the usage of resources and the non-library start points that consistently being leveraged indicates that it's not just the users that disfavor the library, but rather it's a missing piece of the information landscape and one that's really important. And that it's the library collection as represented within the library that people would really like access to, but they choose to follow the path of maybe least resistance. So we know that this is something that's been studied in many cases and not to start on something which is negative, but to put the problem statement really, I think in front of people that really already know this. Now the way forward is, I think becoming more widely recognized and that I guess in the past 18 months or so, we've seen some fairly clear trends within the library world. And I wanted to cover four of these really. Now, I think they're probably some of the big four that other people will be familiar about. So the first one is actually the one I'm not gonna talk about quite so much right now. This is about the Library of Congress and their move towards a linked data future. I'm gonna cover in this as a case study in a few minutes. And it's certainly I think worth spending some time looking at this. I think we also see the leaders within the library industry are coming out with preparatory tools that use graph technology. And sometimes these are original graphs and sometimes they're connecting existing graphs together. And I'm sure you've seen much of what I mentioned here. Then we've got the very well recognized generative AI that's not new. But we have certainly seen the rapid adoption of generative AI. I think in the past 18 months really in the public space. And what's happening with tools like chat GPT and others is I think aligned quite a lot with what's going on within Folio. And that's that both are taking a problem that's hard and trying to make it as easy as possible for people to use. And then fourth, we have really the original problem statement that I spoke about in the last slide. Users don't begin within the library, but on the open web. So we need to speak about sharing data outside of the library in order to engage the users that know the libraries where the highest quality information is housed. So in addition to these primary issues that we are seeing over the past decade or more, we see that large language models base the content that's used in training these generative AI coming from the web. And if authoritative data, and by that I mean library data isn't showing up on the web, then it's understandable that the large language models, it appears missing within those models. And missing information can contribute to widely reported generative AI hallucinations because all the AI is doing with that data that it has is to predict the next word. And we're seeing from research that EBSCO has conducted as well as other preprints that if we integrate large language models with a knowledge graph, then we can increase the accuracy of the results by up to 56%, which actually we believe can be developed further. The data graphs and library data to get better. And that's the direction really that I'm going to be taking today. I'll speak a bit more about what we've done to date, but in a nutshell really we've taken marked data and we've converted it into linked data. And each library gets its own data graph and this is connected across a network of data. We can also work with the library to make their data better by connecting their data graph, their linked data that goes into the data graph to other authoritative data sources. Now I'll cover some of these in a little while. We do this in a few ways. Firstly, we construct the URIs from any strings in mark records when we've got the authority record connected or we can connect it to other libraries that use authoritative sources. And we call this process enrichment. It's really joining graph to graph. We find this approach is able to bring users into the library from tools that they expect to use, things like Google, where your data graph is is picked up by Google. It takes really two business days, give or take. And this allows data that has been picked up from the knowledge graph that I'm going to describe shortly and picked up by Google. And then it allows to do things like search for a book in Google and find the closest library to you or have one that's pinned. So that's kind of the backdrop to what I'm going to be talking about. But really why aren't catalogs visible today on the web? And when I was looking at developing this slide, I was thinking, well, actually there's a conference presentation in Charleston. And it was, I think 2007, 2008. And it was called, Mark is Dead, Long Live Mark. And it was delivered by a lady called Jane Burke. And I think at the time I didn't understand it. And I think we've all had really quite complex relationships with Mark and of course it's pervasive. There has been the desire to change from Mark as the bibliographic record that's pretty universal onto something that is less locked in that can be utilized in web connected environments. And this is what I mean really by that point where it says data is not connected. So within the silos that we have in our library, understanding that the data we have in those silos is fantastically authoritative. If we were able to make that transformation into a model that's going to be able to utilize the authority of the data in the catalogs as well as joining other sources through enrichment, then we're going to be in a position to be able to work with our various communities to find data in different ways, in different forms and go to the place that our different communities expect us to be. I almost didn't show this slide, but I think I'm probably quite glad that I did. I'm not going to spend too much time on it, but it's just a little bit of a glossary, quite a short glossary about what I'm talking about here in terms of linked data, knowledge graphs and large language models. So linked data is the structured data that is represented by relationships within those data entities. And once we have many of these together and those relationships, corresponding relationships within the linked data are understood in a broader context, then that forms a knowledge graph. And large language models utilize tools, they use data that is web enabled in order to train those generative AI models. So I can't dwell on this, but probably worth putting in. And why is having this linked data and why is having the move from Mark into a different format, considering Mark has been around for so long, why is it important and why should it be given any airtime when certainly all of you and people within my side of the library equation as well are very, very busy. Well, it's because we need to increase that engagement and going back to the Ohio link infographic that I showed right at the very start, we need to find ways of increasing that engagement. And really in terms of the collections, we've got those physical collections. We need to make sure that we are able to rehash, reuse and repurpose the data we've got in a way that is useful to our different communities. There's nothing new about wanting to do this. And there's nothing new within library land, but there's certainly nothing new within the broader context of the information industry and other industries. So in other industries, really, I'm thinking here, examples in retail and agriculture and the like, the problem was noted of silo data long time ago. And I think to a degree, those industries have been successful indeed siloing this data by applying a common vision. And you've probably heard of Schema.org collaboration between the big players of Google, Microsoft and Yahoo, 2011, 2012. Well, Epsco has been a big supplier of BIB frame for actually for quite a long time for about seven or eight years. Well, we have the first production ready service that's completely focused on link data within libraries and work that we're doing with Library of Congress and with their decision to move to Folio. Has kind of, it's allowed this 2012 statement to begin to be really addressed. And that's what I'm gonna talk about about nobly packaging up this data. So the case study of Library of Congress, it was around about 18 months ago, the Library of Congress selected Epsco to replace their incumbent library system with Folio. In the next few slides, we'll look at what we're doing within this project and the work done with the Library of Congress that can be utilized in other libraries. I think it's worth pointing out that the Library of Congress doesn't want a bespoke Library of Congress system it wants to have a means by elevating where we go with bibliographic description. And so that's really what I'll talk about in the next few slides. So what is it that we're building with Library of Congress? So we're building a new architecture actually. We call this data.graph. And we've used this approach in other parts of Epsco but we've taken the opportunity to take that technology and all the lessons we've learned from it and make actually large portions of it open source. So others can use this to move beyond mark if that's what they choose to do. And that choose thing is really, really important. There's no mandating the move to big frame from Mark and indeed that's something which I'd really like to resonate with you. But in the movement to data.graph, we're able to look towards a more machine readable and web-friendly format. But to get to that position, we need to have the ability to transform the data from where it is now to where it's going. And we call this a transformation pipeline. Some of you who have gone through these kind of processes maybe will know it is an ETL pipeline. The process takes Mark and using a set of rules that's been really a collaboration of many, many libraries defines the links within the data to form the data graph. And for every person, place and thing we create a link in a relationship between them so that we take the original Mark data and we transform it into big frame to create the data graph. And as folio is format agnostic, it really has been built that the wishes of the community to be absolutely agnostic to the format. Then the data graph forms the architecture, excuse me, of the collection within folio. And I don't necessarily say Library of Congress folio because this is an open source project which allows all libraries to capitalize upon the work done with the largest libraries in the world with the likes of Library of Congress and cascade that to some of the smallest libraries in the world as well. So once we've got this ETL, the transformation process we need somewhere to store those data and the graph does include storage. This is a new type of schema inside folio that allows tracking of link data resources. There's around about 90 different classes that can be seen in the inventory. And this is what we've transformed the data held in big frame. But what about Mark? We're really not gonna simply move on. Simply because there's no reason that we would want to jettison Mark for big frame, certainly in a transition period, which may be many, many, many months if not, sorry, years if not decades, but there's also other reasons why we wouldn't do that as well. One, of course, that we don't wanna see the death of Mark is that there's a long relationship that we've had with Mark. And although the relationship has certainly been strained at points, tools within Library of Congress mean, and certainly when we look at, sorry, we look at the Catalogian Distribution Service, it's something that we really want to preserve because it's a tool that is in use within many parts of the information industry. And Folio will certainly pay credence to that. In terms of the sharing and the portability of data, link data is really an intrinsic part of what we're doing with the Library of Congress Folio system. But sharing of data is really, really paramount here as well. So the data can be taken and used across different systems. For example, if we had open source discovery services, in fact, any discovery service. And the challenge of Mark was not that the data is being locked within your own graph, but that access to those endpoints are in a place which you may not be able to get access to. And with the service that's being developed, that's no longer an issue because of the linked nature of the data. We'll also see that trusting cataloging sources such as the Library of Congress and the ability to feed data into other Folio lines is going to become increasingly important. And to that, we need a mechanism whereby we can edit and describe data within the Folio environment. And we call that MARVN. So MARVN is an application that was initially built by the Library of Congress for the description of resources. And MARVN contains many profiles. For example, it's got manuscripts and rare books and loads and loads of other things. But what we're doing within the project at the moment is we're elaborating on this and we're integrating the MARVN editor into Folio. But we're also ensuring that the MARVN editor can be used externally to Folio so it can be used by other applications. So that would allow libraries that don't have Folio to utilize a bib frame editor in their workflows. And quite frequently, I go back to this quote and you'll have seen it, I'm sure, in many, many places. It's from Linus, the gentleman who developed the Linux kernel and with linked data features, I think how I see is it allows you to take control of your own destiny, much as Linus exposes here with real open source and the control of destiny that you can take in that context. It allows us to look at creative ways of engagement with systems and services and workflows that we have all over the information industry. So that adage of being able to rehash, repurpose and display content in places that maybe we've not been able to do in the past, I think it's gonna be increasingly important. But in doing so, of course, we need to have that cooperation. And cooperation within the context of the EBSCO, Folio and Library of Congress project is ongoing, but of course, there's bigger projects there as well. And what we're looking at doing here is really elaborating on that, the cooperation of the global Folio project, looking at new ways, new stalls, new ways to explore collections, new system architectures, and that system architecture is also important. So this is something that we're doing with Library of Congress, but also with others around the world as well. It's looking at engagement and cooperation really from the start to the finish of the process. And it points to the objectives and vision of Folio supporting entity-based data models. And from the start of the Folio project, the community hasn't always will have, really it's heart of Folio. It's about how functionality is driven forward. So EBSCO works with this community, but we also work with other communities as well. So yeah, we work as a consultant to the Bib Frame Interoperability Group because what we want to do to support our friends and colleagues within the library world is to look at developing that best practice and a cohesive approach across the industry. And so this isn't nebulous stuff. This stuff is coming really, really quickly. The first iteration of the Library of Congress Folio deployment happens in September, October, this year. But before that, I think next month, April, there's gonna be the Marvel editor is gonna be available within the Library of Congress Folio. I believe that's correct. So what can you do today? Because the Library of Congress is a big project moving along at a pace, but the problem exists right now for libraries that want to engage with various communities in various different ways. Excuse me, and the problem statement, really it's the same as I mentioned previously, that the data in the libraries have got, have curated for a very long time. It remains locked in silos, but it also remains as really authoritative and very, very good data. And that means that the representation of those data within web environments and the relationship between the different data entities is not established. And in turn, this means that users can't engage with those resources because not only is the data not where the user expects it to be, but also the relationships between the data that forge new insights into the library collection, those relationships between a subject and an object. And that's not just problematic when we look to use the catalog as a resource, but it's problematic when we understand that those highly curated and descriptive metadata can't be used in the context of large language models, which are trying to predict that next word. What is it that the person is going to want next? And hence why generative AI hallucinations continue to make the headlines. But if we're able to use large language models with more authoritative data, then we've got a better chance of hitting that next word. Conversely, of course, if as now maybe the data is less complete than we would have wanted, then the large language models make their predictions on less than perfect information. And without critical analysis skills, the information that's provided to your library users from these generative AI tools may be considered accurate by then. Hence the disclaimers that we see to check the generative AI responses on the front page of all of their tools. So the problem is definitely there. And we have ways that we can look to address this. The conversion and syndication of high-quality, incredible data could be argued as really never more important with the backdrop of large language models and generative AI. And it really is, I will contest the responsibility of both the library and organizations such as EBSCO that supply services of the library to really work together to engage with our users in different ways and make sure that when generative AI is used that the data that we are training those generative AIs on is as authoritative as we can make it. And here's just six ways that we can look to bridging that gap. They're not the only ways. There are plenty of other ways that we can work. But these are just six of that. Initial transformation I would suggest is foundational. Once we've got that data in a way that is meaningfully interlinked data, we syndicate it to a data graph and we can pick out the aspects of the collection within living lists, perpetual lists which are tailored to the needs of our different communities and delivered in web environments where they need to be. Enriching resources from other data sources whether that's in the case of Library of Congress at least LCNAF or VIAF or wherever else, Orkin. And then wanting to make sure that if we are taking the time and effort to take this highly curated data and transform it, we need to be able to put it where people are going to be on the web. So driving people back from where they expect to find information, actually back to the library catalog and driving usage that way. And then I mean, I mentioned it earlier on but to make sure that folks can show up within a Google search and they can borrow a book from within Google and they get directed to an endpoint which makes sense is an important thing for people to be able to do. And also for libraries to be able to say, yes, I am there in Google. I do have my data in a way that's syndicated towards the Google knowledge graph. And when people do search in Google for simple, fast and easy, then we have the ability not just to buy a book but also to borrow that book from a local library maybe. And then increasingly to join a network of libraries, a connected network of graphs from libraries and trusted metadata sources. So that those important resources that you all curate so heavily are available in different means to different communities. So looking at an example of how we take these data in Mark and we convert them, this is to you Dortmund who utilizes this service. So we'll take the, and we're using an example, well-trodden example of Albert Einstein. So we take the data in Mark format and we have the notion here of creator. So the creator is Albert Einstein. We know that this has a unique position within the data graph and that's the F4, KZN, F0S identifying blue at the end. And this positions this creator within the context of the data graph. And once we have that position and that unique identifier, we have that for not just the subject, in this case, the work, the Brownian movement work, the object which is the creator, the person that created this, Albert Einstein. And we've also got a unique identifier with that relationship to the predicate. And we call these big frame resources. And we amplify these further when we look at how the data is represented within the graph. So here we simply put the data into graph DB. And from this, we can see that we have the gentleman in context here, Albert Einstein. And we have the work that was created by him. And we have, if you see between those two blue circles, the creator. And so we know that within this data graph, we have the subject and the object and the predicate. And of course, within this, the data graph expands based on the number of identifying predicates and the number of different entities that we have here as well. When we start moving this into connected graphs, it stops going, I think, quite interesting. So you as a library right now can, you know, convert your data from Mark into big frame. And that's, you probably get quite a lot out of that. That data is then syndicated to the library link network. So this is the data graph. And it's open. So library A and library B can see what common resources they have, as well as which independent resources. And this may open an interesting world in terms of who has the last copy of sharing of resource, of looking at whose collection is authoritative in a particular way. And there's an example that we have where University of Melbourne was looking at some work that had been done by the University of East Anglia. Who don't subscribe to this, by the way, but we have tested their data within the data graph. And because of some of the work that was ongoing with decolonization, the University of Melbourne was able to reference that work within the data graph and look at their collection versus what UEA had done, which I think was actually quite a valuable exercise. In terms of what we do to enhance these resources, because it's great getting the data from libraries, but there's also lots and lots of other authoritative data sources out there. And this is by no means an exhaustive list, as I'm sure you're aware. But we can take data from these sources and we can automate the inclusion of these data within your data. So within your data graph. And this means that if we're looking at engaging with our users in different ways, we can add in that orchid data, the crossref data. We've been working quite a lot with wiki data. Whatever you would consider to be authoritative, we do have quite a long list of other sources that we're gonna be pulling into the system. Here's, if we went to Albert Einstein in wiki data, we can see that there's certainly non-English language inclusions of names here. And that might be very, very important when it comes to engaging different communities that we have. We see the same thing with Library of Congress. And so what we do in terms of bibliograph is that we will take those mappings and pulling the data as required so that we're effectively joining those graphs together. Once we've joined those graphs together, once we've mapped them, we're able to portray the data in lots and lots of really, really flexible ways. And the reason why it's so flexible, the reason why we associate multiple graphs together, is that that goal of engaging the user and engaging the user where they are, where they expect to find data is very, very important. And so this is a representation of items by Albert Einstein and also items about Albert Einstein. And we're able to very clearly differentiate this. Is that if you were to click on one of these, you would be directed to the Council of Horizons, maybe, or Discovery Service. And it also means that we can show off our collections. And we should be super proud of doing exactly that thing. And the collections that I've been fortunate to view in many lives have been absolutely wonderful. And would be fantastic additions to people's studies and general interest if we can go to where they expect us to be. And so we can create these knowledge panels that changes our collection changes. We can define them really, really accurately. We can put them into faculty pages and lib guides and stick them on blogs and put them in socials. All because that is where the collection is gonna be potentially viewed by our usership. And we utilize this in quite a lot of different places. So this is how the European Parliament define some of their collections. So this one is about European citizenship and they can go in and they can engage people within their lib guides and they make quite heavy use of this service within their lib guides. But then we can also look at how we can appear on the most generalist searches and that if you are searching for a book and you want to find it, then traditionally you would have the ability to buy that book from one of the regular players. But increasingly the work that we've done with Google over the past six or seven years has enabled us to ensure that in some geographies and those geographies are expanding that the ability to borrow a book from within Google is a reality in the US, Canada, Australia and the UK. And that's in the case of this if I search for fundamentals of structural geology then it would geolocate me or I can pin this. Which is good in terms of the workflows that we have. So you find your book within what to read, you have the borrow action associated and then you get delivered to where you would need to be. We're not the only folks to offer this. And I think that's also part of the community. But we do have frequent meetings with Google in order to look at how we can use the data we've got in order to be able to fulfill the expectations of library users. Getting towards the end, thank you so much for sticking with me. There's a lot going on, an awful lot going on in terms of linked data at EBSCO. I've tried to use the case study at the moment of the Library of Congress simply because it's a fairly major library I think we probably agree. But there's other things that we have. We've got for actually for many years we've had a knowledge graph associated with EBSCO's Discovery Service. And some of that work that we've previously done has been made open source. We can also look at ways that we can share other work that we've done. For example, the stuff that we've done with Discovery Service, it would be nice if we could bring this preparatory graph technology of subjects and concepts that allow data visualization in EDSCO to other applications. It's something that we're looking towards maybe doing. And then as we continue to work on bibliograph and learn more and develop the service into other arenas, we look at what's needed in developing different models. At the moment, bibliograph works with around about 20 different library systems and five or 10 different Discovery Services. But it's very closely aligned to Folio. And although it's aligned, it's kept external so that it can retain its ability to work with those many different systems. It wants to be able to fulfill its promise rather than being tied into one system or service. It also means there's no additional pressure on Folio that can be leveraged. So bibliograph can be, or big frame bibliograph can be leveraged as one of a number of different tools that we've got connected to Folio rather than being part of Folio. Bibliograph maintains its contact with the Folio community. The data structures that I've highlighted in the work that we're doing with Library of Congress are very, very important. And this is something that is gonna be part of one of the next releases of Folio. So the smallest library in the world that has Folio can have the cutting edge bibliographic tools that I've described. And the direction of Folio and bibliograph, they certainly are aligned. But we're also thinking about how we move beyond bibliographic. And, yeah, we certainly start in this world and the challenges of course I've mentioned is that Mark keeps data locked inside catalogs and the data are not connected and getting access to the data between different systems, that flow of data between different systems is really hard. But increasingly we're looking at different parts of the information space and realizing that data portability is important, not just in the library system to avoid things like vendor lock-in, that to really facilitate data exchange between different organizations and enable users to retain control over their own data. And that's something that we're looking towards as beyond bibliographic. And to be on, I think the cutting edge of some of these services is really exciting. And the more that we work and learn within this area, the more possibilities that we uncover. I think this is where we begin to really get into the practical application of what does beyond bibliographic data mean. And of course that also means further extensions to link data models and they require support. Yeah, the different extensions are gonna require support. And as an example of that, actually, we recently worked with the LNET consortium in Estonia, which is headed up by the National Library of Estonia on a set of really complex and disparate repositories containing like really many, many types of data. And these are audio recordings, video recordings, PDFs, images, sound recordings, all sorts of different things. And what we did is we took these data working in partnership with Knowledge Integration, who you may know, that we normalised the data into a JSON format that was then enriched by bibliograph data. So the data that we took, I think it was 19 repository, normalising that data into a unified format. And that normalisation process allowed us to have not 19 different administration interfaces and data flows, but just one, which harmonised the process and allowed the ability to look at the data throughput of all the different sources of data. Of course, we took this data into the JSON format and we've created it, its own data graph, which allows us to also use the bibliograph data to enrich that repository data. Thinking about what we want to do here is we want to take the metadata, which may be in some cases complete, in some cases, less so, and wherever possible apply as much as we can in order to facilitate the discovery of this material in the context that is appropriate to that material. So we've developed a really nice modern user interface for this that potentially could be repurposed into other projects. And this is important because we've now got these data, which, you know, they go off to Europeana and they can be put into lots of different other contexts. But importantly, we also show the relationships between these cultural assets and the new pathways that people can get to this authoritative information that they weren't able to do so beforehand. And I wanted to kind of shoot for around about half an hour and I've gone over by 10 minutes. So my apologies for that. But if you've got any questions, then do please ask them all and find me an email. I'll be very happy to try and help. I'm going to go to the Q&A, see if there's anything we have here. So we've got an anonymous attendee. Is this something that could be trialled with a catalogue such as the JISC Library Hub? I've seen no problem with that at all. It's something that, you know, this is a journey that we're going on with lots of different libraries. There's something that please do email me and we'll unpack that a little bit more. William, hey, William. Richard, can you say something about the rights associated with bibliographic records and whether the current regime works to support the innovations you've been describing or hinders them? I don't know, William, if I am qualified to answer that. From my fairly basic perspective and I'm very likely wrong, is that library catalogue data that you have within your system is your data. We don't want to convert data from a particular vendor into records that would cause any legal strife. We want to elaborate on the connections between those different data entities to allow the discovery of that material rather than cause problems associated with the rights you or others may hold.