 Hi, good morning. I'm Herbert van der Sompel, Los Alamos National Laboratory, and I'm here with Mil van der Zander from Gent University, my alma mater, actually. We've been working on a project to create Ling Data Archive, more specifically an archive of DbP versions, and it uses a couple of interesting, well, recent technologies to make it all happen. So we're going to brief you about, you know, project results. It could be interesting, I think. I'm going to start off with a little reminder of Memento and how Memento applies to Ling Data. Then I'll talk a bit about the first generation archive that we had built already in 2010 and why that kind of didn't work out anymore, which leads us to the notion of devising affordable and useful Ling Data archives. This whole notion of well, what is affordable to the publisher and yet still of interest to a consumer of an archive. You know, it's kind of finding that balance. Then Mil will introduce two interesting technologies that fit in this discussion, triple pattern fragments and binary representations of RDF, and then we'll talk about the second generation archive that we built and Mil will then actually give a demo and he insists to do it live. I recommended him not to, but here we are. And then we'll give a couple more pointers. The slides are available in SlideShare and it's actually a more extensive version of this slide deck that is there with more pointers, what have you. So we start with Memento and Ling Data. So Memento basically is a time travel for the web. It is a straightforward extension of the HTTP protocol that introduces the notion of daytime negotiation. So basically it is about a client that gets in touch with the resource and it says, I am not interested in what you look like today. I'm interested in what you look like at some time in the past and in order to make this happen, what you see there is the original resource, that's the one that the client gets in touch with, actually points at another resource, which we call a time gate. And the time gate is a resource that knows about the past of the original resource. So it kind of has a notion of its version history. And once a client is at the time gate, then it can do content negotiation, but in the time dimension. And the time gate will redirect the client to an appropriate prior version of the original resource. So what I just explained is a bridge from the present to the past. But Memento also has the inverse. It's a bridge from the past to the present. So really, this allows one to go back and forth between the present and the past of the web browser version. Memento was introduced somewhere in 2009 and already in 2010 we understood that it was of real interest to link data, to apply Memento principles to link data. So we had this paper at the link data for the OpenWeb conference, in which we showed how to, this principle that I just explained, this time negotiation, how you could implement it for link data. And so what you see on top there is your typical link data environment with at the left-hand side what they call the non-information resource. It's a resource with the URI that represents the abstract concept, for example, of the city of Paris or so. And typically in link data implementations, from that resource, you get redirected to the resource that describes that concept. So that's called the information resource and that's where you find the representation. So since Memento is all about representations, one actually hangs the time gate link of the information resource to go to the time gate. And so once the client is there at the time gate again, it can now just negotiate for a past version of a link data description. That's basically the setup. It's the same as with the regular web. The only difference here is that one applies it to the information resource, not to the non-information resource. In that same paper, we showed the power of this capability. And what you see here is a graph that was automatically created purely by doing daytime negotiation with these resources. So the resources that were used were those representing different countries in DbPdian, and they were each negotiated for different moments in time. And out of the descriptions that came back, one attribute, one property was collected, namely gross domestic product per capita for that country. And so basically by just doing this, for different versions of DbPdian descriptions, you could make this plot. So we thought that was pretty powerful, and that's also why we decided to start promoting these concepts within the link data community. So beyond the web archiving community, but also in the link data community. And that's why we built a DbPdian archive. Basically, I'll outline a bit how we went about that. So the DbPdian dumps are available for download, so you basically download them and you upload them into an archive to make it usable. What we did in the first generation is we used MongoDB to stuff all of that in there. So basically what you have there is one blob in MongoDB per topic URI, so subject URI, like the city of Paris, again, right? And per timestamp, okay? So you have one blob and it all goes into this one archive. The upload software for that was custom. When we came to the more recent versions of DbPdian, it took us about 24 hours to upload the version because the whole index structure had to be recreated in the setup that we had done there. It took us about 400 gigabytes to store 10 versions of it. And we only loaded 10 versions because at one point basically this architecture wouldn't scale up anymore. We just couldn't add further information. So from an access, that was the storage perspective. From the access perspective, basically the current version of DbPdian provided this time gate link to our archive and then basically client was able to negotiate in time with subject URIs of DbPdian topics. So again, the city of Paris, the United States, et cetera, et cetera. So only subject URI access. And these are the kind of pages obviously also available in other serializations, RDF serializations that would come back from the archive. Looks very much to what DbPdian itself serves, only its prior versions. So the time gate software that we used for that, basically that provides access subject URI daytime, that was also custom created, access only on subject URI. And there was an integration with the current version of DbPdian by the creation of this time gate link. So the people at DbPdian actually provided that link to our archive. So as I mentioned, only subject URI access. And then because of the way we've been storing this in MongoDB, basically it was not a scalable solution. And we were not able to load version 2.9 in 2013 anymore. So the archive has basically been frozen since then. We haven't loaded 2014-2015 in this architecture. So because we're really convinced that this is a cool thing to have for the link data community, we set out to really reflect on, so how should we do this in a way that is actually sustainable for ourselves, you know, publishing this archive and also still useful for the client. This leads us to the notion of, you know, affordable and useful link data archives. So where is this balance between what is affordable for the publisher, the archive, and still useful for the customer? And I'm going to do a non-scientific evaluation of a couple of criteria. The people in Ghent actually have written papers about the stuff that I will be talking about, but we can't go into these kind of details. But I'll go over a couple of characteristics of typical solutions. For example, availability of a solution, you know, is it a highly available solution or is it complicated to run it, hence less available? What is the bandwidth required, you know, to access the solution or to provide it? What is the cost for the publisher and the consumer? And then a couple of things that relate to the functionality for the client specifically. So what is the expressiveness of the interface? Like for the previous archive that we had, it's only subject to URI access, but you could think of an archive that supports sparkle queries, right? So that's what I mean with expressiveness. Is it integrated with the linked open data cloud in general? So meaning, can I follow my nose? Can I implement memento support for it? And something that I think is very important goes back to the chart I showed you, this notion of being able to walk across time and across data sets. Can I actually implement it with this kind of a setup? So the first one, I'll do the subject URI because that's the one that we implemented. Availability rather high because you can use pretty simple technologies to implement subject URI access. We use MongoDB, probably not the smartest choice. We could have used ARC files or work files, you know, there are simpler technologies that we can use. So availability can be rather high, both for a publisher and consumer, obviously. The bandwidth required is proportional in this case to the size of a description. These things are typically not big. Your typical linked data description. The cost for a publisher in this kind of solution can be relatively low, again, because it's a simple solution. But at the consumer end, and this is where you start seeing this balance, it's high because in order for a client to really resolve a query, it has to collect an awful lot of linked data descriptions and may actually not even be able to collect them all in order to resolve its query. So it kind of really is more favorable for the publisher here than for the consumer. Interface expressiveness relates directly to what I said rather low. Load integration obviously because one moves from one URI to another just following your nose. Memento support possible because we did it. And again, same thing, cross time and data. You can do it because you're just following your nose from one data set to another or from one timeframe to another. So it's totally positive. The verdict on this clearly not too expensive from a publication perspective, but not very high functionality at all for your client. If you step over to another typical way of publishing linked data, which is by means of dumps. So as I mentioned, DBPD is available as a data dump also. Now this is of course from the publisher perspective extremely cheap, extremely high availability. You just put the file out there, right? And people have to come and download it. From the perspective of the consumer, of course, also high availability. But now, you know, the cost for your client is really high because it needs to get this entire data set which if you're going to work with DBPD, for example, that's very significant, right? You have to download it, store it in a querable environment. So you're looking at significant cost. In a dump, obviously you have no linked data support. This is just a dump. You cannot implement Memento. And when you want to do cross time and cross data, well, basically you need to download all these different data sets in order to be able to traverse them in your local environment. So this is not a very favorable. It's very cheap, of course, from the perspective of the publisher, but not very useful for a client in my opinion. Could be, you know, for an archive, maybe it's an okay approach, but still maybe we can do better than this. Then of course, Sparkle is another way that, you know, linked data is accessed very frequently. And again, the people have done a lot of work with regard to looking into availability of public Sparkle endpoints. And it's really problematic. I mean, this is, you just got to think no one in the past would have thought of putting their SQL database out there for anyone to query. That's basically what we're talking about. You put the Sparkle endpoint out there, and now every crazy can come in and, you know, ask whichever query. So that's going to be problematic. That results in low availability of these public Sparkle endpoints. The bandwidth is proportional to the query. Some queries result in the little data, others in an awful lot. The cost here, again, this relates to the technology that you're going to use at the level of the publisher is really high because you have to maintain, keep that Sparkle endpoint operational. At the end of the consumer, it's low because you can do a really expressive query and only get back exactly what you want, right? So it's a good deal for the client. Interface, extremely expressive, no linked data integration or to an extent, you know, for Sparkle queries that can be expressed as HTTP get your rise. Memento support would be really hard. Actually, again, for certain queries, you could do it. And this cross time and data thing, it's possible, but typically would involve distributed Sparkle queries, which is a research topic in its own right. You know, I keep reading papers about that and reviewing them. I think it's a messy kind of world. So from a publication perspective, extremely expensive, from an access perspective, interesting because you can formulate really expressive queries. But then again, there's a couple of other things that are not really readily available, like Memento support, not straightforward lot integration and so on. So this is an overview of this admittedly non scientific evaluation that I've done when you compare these three kind of ways of making linked data archives available. So the technologies that you can use again, data dump, Sparkle endpoint, subject to your eye access. You see that I left a row open. That's of course for the solution that's going to be really great. Okay. And I'm going to hand over now to middle because he's going to introduce two technologies that are actually from an archival perspective access to linked data really interesting. All right. Hi. So I will just go straight into it. So basically what Herbert said is exactly what caused us to do this kind of research. So we noticed that there was a lack of queryable linked data on the web, which was usually because or they don't want to host public Sparkle endpoint. And if they did, then it had availability problems. So most of the people just stuck to putting RDF files as data dumps online, which was also not really linked data and not really what we wanted. So we were wondering why does nobody explore the access between these two, right? I mean, there are REST APIs we have been working with all sorts of web data access in the past. So why can't we apply these things properly to linked data? So that's where we came up with this calling this access linked data fragments, right? So all possible interfaces to linked data return you some kind of fragment in some way or another. So you could say that a Sparkle endpoint returns you query results. A data jump just gives you all the data, but they're all fragments in some way. So we started with this conceptual framework where we could do and research in and try to find new trade-offs between the interface and what the client has to do. And so we have three main characteristics that define a fragment, which is the selector. What kind of questions can I ask my interface? Is it a Sparkle query? Is it a file name? Is it a subject you or I? A second, what are the controls? You know, how can I, when I query this interface, how can I find more fragments? How can I find more data that is related? Which is very like a REST API, for example, which just uses hypermedia to navigate you to other questions. And then metadata, what is helpful for the clients? What kind of information can I supply with my fragments to help the clients to fulfill its task? And so the important thing about this is that every interface, every linked data fragment interface comes with a certain trade-offs. You can't have it all. You can't. There's no magic bullet here. It's just a matter of finding the tipping point in where can I restrict my server in a way so it's for me easy to put up linked data without coming into danger of availability problems and so on. But on the other hand, the clients can still do the tasks like they are used to. And so as a first instance of such a linked data fragments interface, we came up with triple pattern fragments. So in general, we just said, okay, what if we just restrict the interface to subject-predicate-object triple pattern matches? So for those who are not familiar with RDF, subject-predicate-object is basically what RDF data is expressed in. And so we just allow you to put a variable for each of these free terms. And so this becomes very interesting for archives or basically anybody who wants to publish linked data because we're trying to move the complexity a bit to the client because it's in fact the client who wants the task done, who wants to solve the query. So why should the server do all the work, right? And especially for archives dealing with certain versions, a lot of data, a lot of versions of the same data set, it becomes more and more complex. So we're trying to divide a bit the load. So specifically for triple pattern fragments, the selector like I just said for linked data fragments is subject-predicate-object. The controls give you information to be able to find other fragments. So you can automatically navigate to the next pages. You can automatically formulate new triple patterns. And we use a Hydra vocabulary, which is because it's important to note that this is a machine interface, right? So we self-describe our API so the machine is able to automatically figure out how to ask for new triple patterns to our interface. For us, humans, it's easy, right? We present an HTML page. We know where to click. We know what HTML forms do complete. This is the same thing, but it is machine interpolable, right? So this is the some sort of control metadata. The Hydra vocabulary is, by the way, it's a W3C community group, which tries to do self-descriptive APIs. Very interesting. And then we also supply some metadata, which the most important part is actually the estimated counts that we give. So querying clients can do some optimization, right? They can do some selectivity estimation and which triple patterns to ask first and so on. But I will clarify this in the demo. All right. So, of course, this interface is just, it's an API, right? We still need a data source that confines all our triples and it's easily, I can supply the data for our API. So that's why we stumbled upon HDT. HDT stands for header dictionary triples. And it's actually, it was a submission a while back and it's still hanging there. It's some sort of research outcome from a group in Madrid. But it's actually very interesting because it's a compressed file format for RDF. But it's like a zip file, but it's not only a zip file, it's also indexed in a way that you can ask simple queries. So, like I said, it just, you put in an RDF file, it just compresses the whole thing, it indexes it in certain ways and it generates some metadata and then you can use that as a single file and ask queries. And so the interesting thing for us was this was like a match made in heaven because their triple pattern access was actually really fast and they're also able to count the amount of matches really, really fast. So that's why now all the services that we run have an HDT backend server that is exposed through a triple pattern fragments API. All right. Back to you. So as Mil says, this combination of triple pattern fragments, so the notion of being able to query subject predicate object patterns and HDT storage, that's truly a marriage made in heaven because in the end, HDT is just a static file. There's no moving parts to it. It's just a binary thing that sits there and is self-contained with its own indexes on these three dimensions. So, again, when we look back at our characteristics here, you can have very high availability. Again, it's just sitting there. It's just a file that sits there with its access points bandwidth, again, proportional to the query and the cost very low at the end of a publisher. So very attractive for the publisher and for the consumer. Well, medium in the sense that it may actually have to issue multiple queries in order to resolve what it really wants to do. And I think Mil will demo how that can function. But the thing is that rather complex sparkle queries can really be dealt with in this way just by using these triple patterns. It's really interesting. So interface expressiveness clearly better than subject URI only. And well, not as good as sparkle, but then again, it's affordable for the publisher, right? So, again, we're working on this balance here. Lot integration, of course possible. It's just URIs, memento support, absolutely possible and cross time and data possible because, you know, multiple data sets can have the same interfaces and you can have different versions of the same data set with exactly these interfaces. And again, Mil will demo that. So this is from an archival perspective, extremely attractive in my opinion, very cheap for the publisher and really still valuable for a client. So that's what we did for our new DBPD archive. So we got rid of the MongoDB solution and now have built these solutions using AGT files and the subject predicate object pattern access. So what we do, so this is the storage approach here. We take each of the DBPD dumps and for each we create one of these AGT files. So you may remember the first picture with MongoDB where everything had to go in one data storage, where we don't have that situation here. Each of these DBPD files gets translated in a binary AGT file. So it's very scalable as a proposition. There is software for that. AGT C++ is what we used. Upload time on average about four hours, meaning the conversion from the DBPD dump to the HDT version. Some of these DBPD versions, the recent ones are massive. So four hours is actually a really good deal. I remember that when we tried to upload the latest version in Mongo, it took us a day and we weren't even successful. So storage talked about here we have 70 giga only. So that says something about the compression rate. We had 400 giga for 10 versions and that didn't include the recent ones that are massive. So here we have only 70 gig for all the versions basically up until now. And that is in total 5 billion triples that are accessible in this way or stored in this way. This is the access pattern. So once we have these DBPD HDT files, we then introduce the link data fragment server from Ghent University, which for the purpose of this project was extended with memento time gate functionality. So basically now what happens is a client goes to DBPD, finds a time gate link there to this memento time gate in the link data fragment server and then can start to daytime negotiation. But over the subject predicate object patterns, it's extremely powerful, right? It's just not only subject to your eye, it's now these queries that you can resolve subject to time. So while Mil will demo this, I don't really need to show you that. In the slides that are on slide share, you will find all the base your eyes of all these things. So you can try this at home. You can use memento for Chrome to navigate this or obviously you can do it at the prompt using curl or so. And in addition to providing the subject predicate object access, we also still support your eye access for backwards compatibility. And that's basically just using a proxy that uses the link data fragment time gate itself. So here the time gate software actually directly implemented in the link data fragment server. So if you download the new version of that software, it actually comes with memento on board. Two types of access, the triple pattern fragment and the subject to your eye and date both with date, of course. I think that's about it. And then I'm going to hand over to Mil for the little demo. All right. You know all the stories about live demos, right? This is not like that. It will work. Okay. So can everybody see that? Okay. So this is fragments.dbpedia.org. So this is the official dbpedia triple pattern fragments interface. So this is the home page. So what we typically host at this end point is basically the latest version and maybe the previous version. But it's definitely not an archive. It's just to supply always the most recent version of dbpedia. That's of course where the other archive comes in. And so I will dive right into it. So this is how a triple pattern fragments interface looks like or at least how it looks like in HTML. Of course, it has the RDF content negotiation as well. And this is what actually what the client applications use. And so as you can see, you can just do all sorts of browsing by using just triple patterns that you can use. And here you can see what the metal data looks like. So here it says, okay, this triple pattern has two triples. It has 100 triples of page and it has just one page because there are only two. So I can just keep on going like this. But the interesting part is, of course, what can we do with this? So we actually built this query client that uses this interface to be able to resolve real sparkle queries that you were supposed to send to a server and get back an answer. And now it's actually your client that solves the query. So let me click this. All right. So you see here that DBpedia 2015, which is just the URI that I just showed you. And this is a query, for example, that looks for all movies and the directors of all movies where Brad Pitt played in. And so this is just this client was developed in JavaScript. It just runs in the browser. There's no server-side magic to it. And so when I run this, you can see that, I mean, query times are still fairly fast, depending on the query, of course. But especially for the server, this is really easy to answer because the computational resources are way easier to predict when you just have these triple patterns than when you have to handle the whole expressiveness of sparkle. And here at the bottom, you can see that it just requested a lot of these fragments just to be able to come up with the answer. Of course, this client is always under development. There's query execution research going on trying to optimize this process. So what about the memento aspect? So let me get back to my original resource, my fragments. I will ask for triples that have something to do with Paris. And now I can use the memento plugin for Chrome. It's fairly small, but you can see it in the right top. And there I can set a date, for example, I can bring you to 2010 and see what happens on January 2010, how this page used to look like. So if I now use my plugin, I can say get near the save date. And then we'll start navigating. So what happens is actually now it's negotiation, it followed the link to the time gate, negotiated about the date. And now you can see that I left fragments.tbp.org, and I'm now in fragments.mementodepot.org. So this is a completely different location, but because of memento, you can just automatically navigate to the right version. And so I can use this as any normal TPF server. I can just stay in this version if I want. And I can also just navigate back if I say get near current date. You will see that it will bring me to the memento that represents my most recent version. All right. Now for the magic. So I tried to also implement this memento feature using the Spark queries, which was a success. And so let me just demonstrate how, what this means for executing Spark queries. Let me see if I can... All right. So I created this query. It's a Spark query. Don't be scared. It's not... It's quite easy actually. What this does is it looks for all the albums that were recorded in San Antonio, just gets the label and the record date, and the artist, and then gets the label of that artist. The reason why this is double is because dbpja changed their schema somewhere in the middle. And if I were to query old versions, it wouldn't give me any results because it doesn't know that schema. So this is the query. And then I have the same client as you saw in the browser. I just have it in a Node.js version. It's just the same thing. It just works on my desktop instead. I wanted to do it in the browser, but I had this whole cross-origin issue and didn't work in time. But so this is how I... Can you see that? It's like all the way at the top, right? That's a shame. I'll do this and then I will do this. Okay, so now it's just a command line client. So here you can see that I'm trying to query fragments.dbpja.org. So I'm not going to the archive. I'm just going to that one. The query file, just some output parameters. And so if I run the query like this, you can see that it comes up with the results. No, Black Sabbath was here. And so if I run this query again, but now I add a date time, which is 2012, but same URI, you can see that I will get different results because this is what the data used to give me in 2012. And you can actually see that in this one, there was a recording in November 2012, which is obviously not in here, but it's also fun to see that there was actually data that disappeared over time. Okay, I think that's it for now. Let me get back to the slides. All right, so as a last thing, the important thing here is that as long as publishing link data stays hard and it becomes too expensive and so on, we won't be able to do this whole link data metadata thing in the end. So the important thing here is that you can really try this at home. It's a fairly easy and cheap way to publish link data. And now I will just shortly show you the steps of how you just set up your own interface. So typically you just take your RDF files, you put them into an HCT file using a C++ application. This is research code. There's still a lot of issues in it. It works, but you have to be careful. We're still looking for people who want to contribute making this software better because it's really, really great. But we noticed that it doesn't work well with RDF data that it isn't cleaned in advance. So a lot of manual cleaning makes sure the URIs are encoded properly and so on. That's also why some DPPD versions were not completed in time. And yeah, it still needs some work and the code could give you a lot more feedback than it does now. And it's also very memory intensive because it really compresses really well, but in order to do that, it needs all the information in memory at the moment. I know the guys that do this, they're working on a MapReduce implementation that is supposed to solve most of these issues. But again, it's research work, so production value is still early. There's also a Java version for those who want. But in general, it's really easy. It's a single command saying RDF28GT and it just starts rolling. Second, download the triple pattern fragments code. Actually, this is more than a triple pattern fragment server. It's a whole link data fragment server. So we add plugins and modules whenever we can, like for example, the momentum support. It's a Node.js server. We also have one in Java. We also have one in Python, I think. We have some third-party implementations as well. But this is definitely the most mature server. So I would advise you to try that one first. So we have support for all sorts of data sources. HGT is recommended. But if you want, if your data is small, you can just put up turtle files of keep something in memory, or even use a sparkle endpoint if you want. And somebody also for the Java server added some blaze graph support, which is a high-end triple store. But I haven't tested that yet. And then version two was just released, like Herbert said. It had some breaking changes because the hydro vocabulary changed. And then step three is just configure your server. It's just composing adjacent files saying, these are my data sources. These are the mementos. These are the time ranges that they are valid in, add some time gates, connect them with the mementos that you want them to navigate to, and so on. You can, by the way, all this is in more detail in the repositories on GitHub, on the readmes, and so on. I will just not go into it right now. And then all you need to do is just start your server with a single command. And basically that's this. That's it. We usually don't need more than half a day or something to start something up. If the most time is generating the HGT, once that's done, it's basically half an hour or something. And it does scale because we have this Amsterdam project that hosts like 640 billion triples or something, and they all do it in this way, which I haven't seen any Spark Lamp.2 yet. So that's it for us. I think we have time for questions, right? So that's with the query algorithm. It just finds out how to split the complex query in more simpler queries, and it uses the count metadata that I briefly mentioned to optimize this process. And we're also experimenting with which metadata can we still add, which changes can we make to make the queries go faster. Not the server. The client does this. So all the server does is answer triple patterns. It doesn't do anything else. We try to. In theory, it is some things like optional distinct, well, not distinct, well distinct as well. Some things are just hard that go really, really slow. They work, but they go really, really slow. But then again, most of these things are weird that they're in Sparkle in the first place, because they're so specific and so databasey. I don't know how they were planning to make this work on the web. Yeah. Yeah. Like counting, for example. Yes. So there are two servers that do that. So first of all, the node server, because that's the one who returns the data. And of course, there's an HTTP cache in front. So there's an NGINX server who basically is able to reuse most of the fragments, especially when you have multiple clients querying. Because of the granularity, it's better for caching in general than Sparkle query, which is always different. Yeah. I mean, in general, all requests are really small. So it's not that big of an issue. But we're still looking at getting this bandwidth down more and more and more. Yes. Some people voluntarily started doing things in Python and things in Perl. But ours is Node.js. And it's freely available. It's MIT license. And you can compile it for the browser, which is what you saw. I mean, technically, you don't really need to have it. You can just go to client.linkedata fragments.org, which is what I demoed. It's just you open up the browser and you can start querying interfaces from there as well. I mean, they're totally separate.