 Hi everybody, welcome. So, I'm very excited to have you all here to this Sparkle workshop. And I wanna say a few words about how this works and why we're doing this and introduce our guests and give you a sense of the agenda and what's gonna happen in the next two hours. So, can people see my screen? Nope, I should. All right, so, why this workshop? So, Stas and I talked a while ago and we realized really there is like a big gap at the moment, both social and technical between WikiData and the rest of our projects. And I feel like this is something that I want to address personally and give people the opportunity to understand more about what's the deal, not just with WikiData per se, but also with the APIs that make it possible to retrieve data from WikiData. And so, I see this opportunity primarily as a space for analysts, scientists, product managers, engineers to learn about these APIs, learn a little bit about the syntax that is used by the query service, see what's possible with it, and also hopefully clear some myths around WikiData. And the one thing I wanted to start from is the fact that I often hear that WikiData, oh, this is like an insular project that is out there, built by people, speak German, and sure, it's a cool thing. It's growing, but we don't quite know what it does. It's mostly a place for glam people or open data people to dump data into it, or maybe to connect articles via inter-language links, but that's pretty much what it is. And I think that's a big misunderstanding about what WikiData is and what it represents. And I wanted like 30 seconds to show you one query that to me represents the future of Wikipedia and Wikimedia projects. So that's a curious case of Francesco Primo Gattiluzio. So this guy is someone who was born in Genova in the 14th century, and it's a notable Italian individual according to the National Biographical Dictionary, the most popular source of notable individuals in Italy. And if you check his WikiData entry, you will see that, well, there's quite a lot of information about this guy, and it turns out that he has an entry in nine language editions on Wikipedia, except for Italian. And I thought, well, that's interesting. It's such a low-hanging fruit that there should be an Italian Wikipedia article about this guy who's notable in this biographical dictionary and notable in nine other editions. How come that exists? And there's a query that allows you to see how many Francesco Gattiluzio we have, and it turns out that we have over a thousand. And this is just for whatever currently exists in Wikidata based on the matching with this dictionary. So to me, this is an example of how, like right now, we think of Wikidata as something that's created after Wikipedia article to add some additional data. Three, four years from now, maybe three years from now, I expect the contributions on Wikipedia will start from Wikidata as a backbone of the entities that need to be expanded and created, and we'll see free-form text flow from structured data as opposed to the other way around. But with that, I just wanna show this example to give you a sense of why I think this is gonna be a big thing in the next couple of years. But I don't wanna take too much time, and I wanna go over our agenda. And first off, introduce our guests. So I'm very glad to have today people who have been using Sparkle and Wikidata for a while for fun and for business. And I'd like to introduce Ruben from Gant University, who made a pleasure to meet at a conference a few weeks ago. You have a very compelling speech on Sparkle and points and federated queries. So I'm very happy to have him here today. We also have Benjamin Good, like Ruben, are you there? Can you say hi? Absolutely, thanks for having me here. Welcome everybody, and I'm really excited to talk about Sparkle today. Fantastic, and it's 9 p.m. where Ruben is, so extra bonus points for being with us such a great time. We also have a Ben and Tim from the Scripps Research Institute, and from Gene Weekie. Gene Weekie, for those of you who don't know about the project, is an amazing project that used to be built on top of Wikipedia. Now, this group is working with Wikidata directly to basically annotate, to store facts that are extracted from literature in the area of like bioinformatics and genomic research more specifically, make sure this data is available and represented on Wikidata, and can be used by other scientific communities as well as our own contributors and readers. I think this is like a fantastic project, one of my favorite examples of expert contribution to Wikidata, and Ben and Tim are gonna give us an overview of how they're using Wikidata and Sparkle in their own projects. Are you guys there? Can you see us? Yep. All right, fantastic. And finally, we have Lucas, who's gonna be helping as a facilitator today. Lucas runs Wikidata Facts, which is a tremendously useful Twitter handle that showcases like many types of Sparkle queries and helps people like understand Sparkle and understand Wikidata one query at a time. So again, very happy to have you two, Lucas, to help with the rest of the workshop. And if you're there, say hi. Very glad to be here. Hi. Hi, welcome. And so with that, like a few practical notes about the structure of the workshop, so we're gonna start with two talks by Stas and Ruben. It's gonna be roughly 25 minutes plus five minutes for Q&A. After that, we're gonna have a short break of 10 minutes. We're gonna continue with a presentation by Tim and Ben and we'll have the remaining 45 minutes for basically hacking, playing with examples on the Wikidata query service and answering any questions that people may have around Sparkle. We'll be recording the presentations, so just be aware that if you don't want to be recorded, you should not be joining the BlueJeans call and you should meet yourself. And we will not be recording the second part, so we'll have like a fail safe space for learning and hacking and asking questions. There's also an IRC channel that Nick will be hosting. So if you have any questions and you have something during the presentation that you want to discuss, please post your comments on IRC. We'll be relaying them to the speakers. Also final note, the microphones here are not working because we're recording. And so if you have something, I'll be relaying your questions to the speakers. And with that, I'm gonna take more time and Stas, I think the stage is yours and you can get started. Hello, so can you hear me well? Yes. Okay, so for the beginning of the presentation, first two short points. First thing, I'm sorry, I plan to be there in person to help with the later Q&A and hacking session due to injury I'm not able to. So please ask any questions you have on IRC or later in email or C channels. And second, there is a lot of material to cover and not a lot of time. So I'll be glossing over and rushing through several things. So please ask later if you have anything unclear or if you want to know more and there will be a list of literature at the end. So I refer to that too. So it was that first thing I wanted to start is wiki data. So a short, a very short refresher on what wiki data is. Wiki data is a free structured knowledge base where the data is represented on many languages. No language has a preferential status unlike for example, English Wikipedia where English is at the language. So we'll see how important it is later. And it's under a free license. So it's designed to be the data to be reused and mixed and matched and so on. And it's around 20 million entities now. So it's a pretty big database. And let's see an example how the data looks like. So this is the data about the city of London. So here we see first of all multilingual aspect. We have a lot of names. And a very interesting thing is that on the top left corner you see the Q84. So that's the true name of the item in the wiki data. And that's the name it is known to all the data. It's not London, it's Q84. And London is just an English name. And it's not any more special or preferred than the Russian name or Spanish name or Hebrew name. It's just one of the strings that belong to this item. And when we do Sparkle queries we actually will be dealing a lot with this Q things instead of labels. And we will see how to deal with them well without going crazy. So the next thing we see is the statements. So all the data except for labels are organized in statements. And the statement has internal structure which consists of a property. That's what we are talking about. Like for example, for London it might be population or mayor or when it's founded type of the things that we're talking about. It has a value which describes what is actually the data that we have. And it also has two things additional. It can have a qualifier which adds additional information for population. It might be when it happened or how we know it or for mayor it might be the times that he or she saw as a mayor and so on. So it's kind of things that describe more details pertinent to this piece of information. And there is also a reference that says basically where we know this thing from. So it might be from Wikipedia, from some journal, from some URL, a lot of from some other encyclopedia, a lot of things. So all together it's called a statement. So this is how the data is on the Wikidata. And the next thing is how to make this data and knowledge. What I mean by that is how we can use this data beyond stating the mere facts, how we can make inferences and learn facts that are not stated directly in Wikidata. And the famous question that we are was asked that started the whole service thing is what are the biggest cities having female mayors? So to get this answer from just the data that is present on Wikidata, a manual is kind of hard you have to go to all the big cities and check the mayors and check if there are females and make a list and sort it and so on. So that's kind of a lot of work. So we won't engine to give us answers to these questions and we will see how we do it. But we will start from representing the data in the format that is available for such queries. And this format is RDF. So RDF is a short for resource data framework. And it is a very simple way of representing knowledge. Basically it contains three things. So each RDF data item is called a triple and it contains of three things. Subject, predicate and object. So subject is what we are talking about, for example, London. Predicate is what we are expressing like population and object is what is the actual content of this knowledge, what we are saying about it. So London population is eight million and change. London is subject, the population is predicate and object is eight million. And so we have other statements that we can make. Basically a lot of knowledge can be represented this way. And another thing you may notice is that this structure is very similar to how graphs look like. So we can represent any knowledge like this as a graph and alternatively any directed graph can be represented as RDF. So how this is relevant to wiki data. Okay, first how we actually write this down. So as I say, RDF is a very abstract concept. It's just saying three things. But in computer we have to write it down. So there are a lot of ways to write it down. You can write it, basically you can write it in any form. You can write it as XML. You can write it as JSON. You can write it as S expressions. You can write it as basically anything you like. But the way we actually will be using in this presentation is two formats that are important. First format is N-triples. It's a very simple format. You just write these three things that you're talking about. One after another, space separated and put a dot in there. So here you have a subject N-triples, predicate is and object line-based. So this format is very simple. Lane-based can be processed by a lot of tools. And about the downside of this format, it's very verbose. And if you see actual data in this format, it's kind of hard to read because each triple contains all the information, including long URLs and so on. So there is a shortcut format that's called turtle, which allows to write the same data in a bit more human friendly way. And it allows you to use shorter form of writing the URLs in objects, sorry. And also allows you to not repeat subjects and predicates. We'll see how that works a bit later. But the important thing is to know that turtle is the format that we will be using discussing Sparkle. And this is how the triples are expressed in Sparkle. And this is one of the ways that LDF can be represented. So now let's see how we actually represent wiki data. So wiki data as we have seen a bit before is kind of complex. You have statements, you have qualifiers, you have references and so on. So this is the graph that actually shows how we represented. You don't need to remember it. You just kind of need to glance at it and see a level of complexity. So don't expect yourself to remember all of it. You probably will need to refer to the documentation from time to time to get some aspects of it. But it's not like super complicated. It's about a dozen things that you need to remember. So let's see, that is actual representation of an item. So first of all, we see that a turtle format. So we have, again, we have a subject predicate an object. And we have a semicolon, which says basically we are still talking about the same subject. So that's one of the things that turtle format allows us, which is pretty convenient. And we can have a number of things that we can write down in this format. First, the one of the things that most common things is node or URI. It's basically almost the same as URL. And it usually in turtle, it usually contains about contains of prefix and suffix separated by column prefix is a standard. So we have a bunch of standard prefixes. For example, WD from Wikidata is a standard prefix for all the items. And Wikibase is a standard prefix for all the things that relates to things that are in Wikidata like items. Also, we have literals, which can be strings or numbers, or it can be typed, for example, dates. Strings can have also language and we can have typed strings like date time. We can also have blank notes, but we won't go there because it's kind of not, we don't have time to go there. So this is basically how a typical turtle dataset looks like. And this is also how Sparkle query data would look like when we get to it later. So just look at it, that's how it works. So now we get to Sparkle. So Sparkle is the language to query RDF data. So the name is Sparkle is a kind of recursive acronym, which means Sparkle protocol and RDF query language. And what it is, it's quality, it's the declassive language, it's SQL like. So if you know SQL, you kind of have a rough idea of how Sparkle would look like, maybe not how it works, but at least it would look kind of familiar to you. If you don't know the clarity, it basically means that unlike languages like C or PHP, you don't tell the language processor what you wanted to do, you only tell it what you want to get. You describe the data you want to get and it gets the data to you from the description. So this description in Sparkle basically composed of triple patterns. We will see a bit later and filters and kind of modifiers and transformation on these triple patterns. And triple patterns are expressed in total syntax, the syntax we just seen, and it produces the values out of it, which are the results of our queries. So let's look at some queries. So here's the simple query that finds all the cats on wiki data. So what we see in this query, the first thing that we have the first line after work is the triple pattern. So in this triple pattern, we have three things. We have the green things which are fixed details. So again, WD is the item and WT is the predicate. We'll get later to how we know what they mean for now just believe me that the 146 is a cat and 31 is instance of. And we get this item thing. So item things is a variable and it's custom to write in Sparkle variables with question mark before them. So basically what this pattern means is that we want all triples that have two elements at the end be fixed ones. And the first one can be whatever is there. So this is kind of a match that will return series of elements matching this pattern. So another thing that we have in this query is basically the select and the projection variables. So this looks a lot like SQL. That's not a coincidence. It basically works a lot like SQL. So select query basically is for retrieving things. There are other queries in Sparkle but we won't be discussing them today. And then there are go variables and then you have where. Select is almost where for the ends in where for today at least we won't be considering any other possibilities. And another thing we have is a service. We will talk a bit later about what services are for now. It's kind of just magic box that produces things that we cannot do by Sparkle triple matching. And this specific kind of magic box which is called the label service is producing labels. So instead of these skew things we would be seeing actual names of the labels. And Ian means that these labels will be in English. So we will talk about how services work a bit later. But right now basically we can see that this query produces list of items and their names of the cats in Wikipedia, in Wikidata, sorry. So let's go to the more complex query. So here we instead of cats we have a list of billionaires on Wikidata. So first thing we see here after where is again the triple pattern. So you can see now that there is two variables in triple pattern generally it can be all three but then we basically say give me all data which is rarely useful. So usually it's one or two things in a triple pattern that is fixed. And the third one or two or one is variable. That's usually how it happens. So what we see here, the new things is filter. Filter is a thing that allows you to basically apply a filter on the triples that we match and only choose ones that satisfy certain condition. So here we want the condition that the net worth of this individual that we selected there would be one more than $1 billion. And another thing is bind that basically allows you to create new variables from expressions for use them later. So now we have a variable that expresses how many billions that that person has. We have set the label again and another new things we can order things. So we can get the most billionaire person first. So again, it looks a lot like SQL if you're familiar with it's not a coincidence. It's designed this way to remind you of that. And then select again we see the variables that we have produced. So now we are mostly ready to see the query that we started our discussion with. This is the actual query that selects top 10 and it is with female males. So again, we see a number of triple patterns. So this is an important thing before we saw only one triple patterns but they can actually be combined. And the way it works is that the result has to satisfy all of them. And the variable if it mentioned in Sarah part of course the value would be the same in all parts. So it's kind of a bit like join in SQL but it's just a number of patterns. So we have here in the first line we have something that is called the path expression. I won't be going too much into that because it's a bit complex and we don't have time but basically it allows you instead of just much one triple it allows you to specify a path on graph we saw before that the triples are kind of like graphs. So this is kind of a path on graph that you want to match and beginning of the end of this path is what you say. And we have another filter that is filter not exist. So this is a negative filter that means that basically you want to say that something would not be matched. So why we need this filter? So if we will read the comments we basically find the cities and we hide heads of government and then we find that this head of government is female. And then we want to say that this head of government should be actual head of government not past one because we want only count mail. We could have omitted that and have a query that selects all the mayors past and present but for this query we want only present. So we have a negative filter on end of the service. And this way we ensure that only current mayors are included. And then we select a population, we order population and then we have another familiar from SQL is limit. So we can limit how many results are we getting and this is very useful clause when we actually have a potential we're having a lot of results because if you don't put a limit on it, it might time out. So if you have a complex query that potentially turns a lot of results do use limit otherwise you get time out. So now let's get to this magic box of service a bit and talk about it. So basically service is a piece of a code that in this implementation is a Java code that allows you to do over various magical things that are not available directly in Sparkle. And currently in this implementation we have a number of services. The most important is the label service which can basically produce the labels and descriptions of items. And the way it works is if you put the service clause in the query and you select an item and you then ask for a variable with the label or description or alias attached to it it will produce this label or description or alias if it's there and you can give it a number of languages. So you can see I want Spanish label if it doesn't have Spanish then give me French if it doesn't then give me English. And if everything fails then you will get this boring queue number instead but you will get something at least that you can display. Next service is a very interesting one. It's a search around. So basically what you will get is you will get all the things around the certain location and you can tell it the center which in this case is coordinates of Berlin. You can give it a radius in kilometers and you can also get the distance from it. And then you can for example order by it or you can just or you can match for example this query finds airports within 100 kilometers of Berlin. So you can all get everything within 100 kilometers and you will check that this is an airport and then you solve by distance. So this is another useful service for map searches and there is also a search that also geographical but searches in the box. So this query against everything that is in the box between San Jose and San Francisco for example. Okay, so now we get to how we actually work with the query. So far I showed you the kind of queries that was pre-produced but how you actually do this by yourself. So you use the query GUI which is at query.wikidata.org. So basically this is the GUI. So let's go back to our first of all if you're if I'm not sure what to do click on examples there's a lot of a nice example query. And if we go to the cat square so I promised you I will show you how to deal with this Q items. So what you do, you just hover over it and it tells you what it is. So what if you want to write the one yourself? So you press control space and let's say you want instead of a cat, you want a dog. You just select it and it automatically substitutes you. Okay, so back to the presentation. So next thing that I wanted to show is basically what else we want to do with the we can do with the GUI. So let's continue with our cats and let's see what internet is really for and that's cat pictures. So basically what we can have is if we have images then we can also have the images in the GUI. We can also have other visualization modes. So we have, for example, we have maps. So this is map of the Berlin, the U-Bahn stations. You see nice kind of nice route. So we can map the items on the map. Next thing we can do, we can build graphs that present data. So this is, for example, the tree of Chinggis Han ancestry. So once it actually builds it, yeah. So let me just enhance it. So we have this graph. So this is, so where is the Chinggis himself? Yeah, here he is. So you can see basically there all the ancestry and descendants of Chinggis Han. So you can build this nice reservations with it. So what else you can do, you can also, if you have some data that are ranked, you can also build the bubble charts. So this is, for example, the bubble chart of the causes of this in Wikipedia, in Wikidata. And yes, we can also build timelines. I don't have time to show all of them, but click on display and you have here, we have a list of what you can do. So experiment with those when you have time. And also one thing that I want to notice on this one, if you look really closely here, there is another service that I haven't talked about because I don't have time, but I check into it. So there are more services available. So next thing I wanted to show you is basically that you can also, this is kind of the project in progress. So that's probably the last thing I'm showing to you today is that you can actually build your own graphs. So this is a query that says, how many musicians die at a certain age? So let's say we wanted to create a graph from it. So we say, graph it and we have this Polestar interface that allows you to build your own graphs. So let's say you want a lines instead. So then you click on export and you have this graph and then you can go just go to Wikipedia page and put a graph on it. And put a graph here and let's see. And you have a graph. And this graph is a completed data driven. So if you actually look at the source, you have a query inside. So basically this graph is data driven and if data changes, this graph changes and you don't have to do much except for like copy paste, create the query and create the graph with drop downs and copy paste it. And you can actually take it further with Yuri's help, we have templates that you can do. So for example, there is a template that London population history. So you see the Q1, remember we see the Q for London. So if you replace with the ID for any other city, you would get this nice graph for any city you like. And you also combine this with maps. So this for example is a data driven map of a country and it includes the capitals of all the states and populations and images. So again, this is all configurable by country. So we have the IDs here and it's generated from Wikidata and or some services. So this combines basically two services into this visualization. So that concludes our demo part and I'm basically done. And kind of I can say now you are on your own except that you are not. There is a lot of things that you can get help from. There is a list of link in this presentation. So you can use it when the slides are published, you can use them to learn about Sparkle and our implementation of it. And you can use, we have a lot of community resources. We have a bunch of tools. So I want to mention one specifically, it's a converter from an old Wikidata query syntax to Sparkle to use it if you're familiar with the old one. And the next two ones is for properties and classes. And this is the list of community sources that you can go to discuss queries, ask about queries, look at the examples and discuss all the things related to this thing. So with that, I'm concluding this presentation and please contact us on the mailing list or on the IRC if you have any questions or need any help with this. Fantastic, let's give us a random applause. Vega export is really cool. I haven't seen that before. Okay, and with that, we're gonna move on to our next speaker. That's Ruben, I believe. And Ruben, you should be up next and connected. There we go. Everybody seeing the future is federated on screen now? Yes, we are. Okay, very good. So as Dario said, this is a slightly shorter version of a talk I gave at Vivo, which will be about how we can execute Sparkle queries, not against one data source, but against multiple data sources and this life on the web. Why is that? Well, knowledge is actually, it's inherently distributed if you think about it. Knowledge is never gonna be in one central place. There's always gonna be multiple sources of truth and even wiki data will never be the single place where everything is stored. And that's actually, that's a good thing because this is how human knowledge works. What's also the case is that human knowledge is inherently heterogeneous. So there's different kinds of knowledge in different types of formats and so on and so forth. But fortunately on the web, knowledge is also inherently linked, connected to each other in various ways. This is of course the essence of linked data. And if you want to query such knowledge, this inherently distributed but still linked, there are important questions to answer such as, where do we find the data that we need? How can we access the data that we need? And how can we integrate the data together? And what I'll show today with a demo is that it is possible to integrate multiple data sources and doing this life on the web, but we will need to set our expectations right. Federation cannot solve all problems just like centralization cannot solve all problems, cannot have all data in the world. So if you all agree, that's well, we will not be able to do actual magic here, but still pretty cool stuff, then I think we have the right place to continue. So today I'll be talking about three things. First, I'll talk a bit about Sparko and RDF. I know you just had the introduction, so I won't talk too much about that. But still there's a couple of important things to say about Sparko that haven't been said before. The next thing I'll be talking about is a lightweight interface to RDF on the web. So Wikidata endpoint is a very heavy endpoint, which means it can do all kinds of nice things that we've just seen in the previous talk. Actually, for some use cases, lightweight interfaces are preferred, and I'll explain why and how they work. And then at the last part, I'll show you a demonstration of how we can query multiple sources live on the web. So first, let's talk a tiny bit more about Sparko and RDF. So as you just have seen, RDF is the data language that we use for the semantic web and linked data. And as was discussed, the basic units of RDF data is a triple and a triple consists of three parts. There's a subject, a predicate and object. It's really that simple. And Wikidata has also been made available as RDF data. Then there's Sparkle, but actually Sparkle is two things. So as we have seen in the previous language, on the one hand, Sparkle is a query language, which you can use to ask questions on RDF data stores. But at the same time, Sparkle is also a protocol to execute such Sparkle queries over the web. And in this presentation, I'll be criticizing Sparkle a lot. Just know that I'm not criticizing Sparkle the language. I am criticizing Sparkle the protocol. I think the language is a wonderful idea. I'm just not sure whether the protocol is also a wonderful idea. The one exactly is the Sparkle protocol. Well, basically, in the interaction, you always have a client and the client will use the Sparkle protocol to ask a Sparkle query to a Sparkle endpoint. So the thing in pink there is a query. It's excreted in Sparkle language. And a Sparkle endpoint is basically a server that says, you can ask me any Sparkle query. And the Sparkle protocol is simply a set of agreements on how the client should send a Sparkle query over the web. So that's it. The Sparkle protocol is sending your Sparkle queries to some server on the web. This means that you can say things like, hey, Sparkle endpoint, I have a question for you. I wanna know which artists are born in San Francisco? And Sparkle endpoint will reply, while no problem, sure, here you go. Here you have a list of artists. And the thing is you can also ask your Sparkle endpoint and hey, Sparkle endpoint, I have this really complicated question. And by the way, this is an actual existing query. And a Sparkle endpoint will say, sure, no problem. Here's the answer. And I think you're kind of seeing where I'm going here. Sparkle endpoints are very nice. In fact, they're a bit too nice. They'll do everything you ask from them. And as such, you can imagine that it's very expensive to host such a Sparkle endpoint. So the question is, can I Sparkle your endpoints? Like, well, for Wikidata, this is true. You have powerful Sparkle endpoints and you have the funds to keep this up, which is great. But there's lots of smaller organizations, lots of people or institutions publishing data who don't have the money to have a Sparkle endpoint because having a website is very cheap. But having a Sparkle endpoint is about, it's about the most expensive API that exists on the web. And as a result, there are two problems with Sparkle endpoints. First of all, if you look on the web, we don't have a lot of Sparkle endpoints just because they are so expensive to host. So again, congratulations Wikidata with your endpoint. I'm really happy you have it, because for most organizations, unfortunately it's not feasible. And the second problem is that of all those endpoints that exist, the average endpoint is down for one and a half days each month. And this is quite a disaster because this means that if I'm gonna build an application on top of that Sparkle endpoint, my application is at least going to be down for one and a half days each month. And if I'm building an application on top of multiple Sparkle endpoints, let's say just three endpoints, while in the worst case, my application is going to be down for four and a half days each month. In other words, the application won't just be reliable. And why is this such a problem? Well, this means that the whole vision of a semantic web of having applications on the web with live data, it's just not working. So let's see what we can do to fix that. And this brings me to the second part of the talk in which I wanna explain what we've built. We've built a more lightweight RDF interface, which is less expensive to host, but of course, this comes at a price which I'll explain next. So when I'm designing interfaces, I'm not primarily thinking about machines or computers. I'm always asking the questions first, what would the average human do? A little background about myself. I'm a researcher and my research focuses on building intelligent clients. So when I'm designing interfaces for those intelligent automated clients, I'm still thinking about people. If I would solve this as a person, how would I approach this? And this is my inspiration to design for machines. So for instance, how would the average human solve the question that we've seen before? This sparkle query that says, give me things that are artists, give me the name of those things and those things should have birthplace San Francisco. In other words, give me the names of artists born in San Francisco. Of course, if you give this to an average human, they will panic because they haven't seen sparkle, that is unless they have seen the tutorial we had previously. So let's imagine that an average human gets a question, which artists were born in San Francisco? How would they answer this question if they only have Wikipedia? Well, if I were an average human, what I would do is I would just go to the page about San Francisco on Wikipedia, for instance. Then I would make a list of all the people that are born there in San Francisco. And then for each of the members on the list, I would check their Wikipedia pages to see if they're an artist. Sounds like a great plan, except that the second step isn't entirely realistic because how can we be sure that the page of San Francisco has a list of all people that are born there, maybe there's a person born in San Francisco that's not on that page. So this method has some limits to it. And this means that if you want the person to be able to do this, we need to empower them. We need to give the average human just a little something extra to make sure that they can answer a simple question like this one. But I'm not gonna get on a sparkle endpoint because they are really, really, really expensive to keep up. So I'll give them something else. And then the question comes, what is the simplest complexity? So given that sparkle is really expensive and given that simple hyperlinks like on Wikipedia are insufficient, what is the simplest thing that I can do to still enable human to answer this question? Well, to find a solution to this, I went back to the essence of RDF. And the essence of RDF, as you know by now, is triples. Everything in RDF is a triple, a subject, a predicate, an object. Now the essence of linked data is that you can browse things by subject. You can go to a page of a certain person on Wikidata or Wikipedia and you will get all information about them. So for instance, if I want to know everything about San Francisco, I just go to the San Francisco page on Wikidata or Wikipedia. Now what we propose is an interface called TPF and TPF does this little extra thing where you can not only choose your subject, you can choose any of the three components. So you can say, no, no, I wanna know things that have San Francisco in the object position, not just the subject position. In other words, I want to see things that link to San Francisco, not just the things that San Francisco is linking to. So this is the interface we propose. You can ask questions that consist of a triple pattern. Remember in a previous presentation, the basis of Sparkle queries is a triple pattern well or interface with just triple patterns. And this is also what TPF means. TPF stands for triple pattern fragments. So our lightweight interface will offer access to a dataset based in parts and fragments that you can select by a triple pattern. This means that clients can only ask for triple pattern to the server. So they cannot say here is my complex Sparkle query. That's not possible. They can only ask for a single triple pattern and time. So let's see how an average human would answer the question which artists were born in San Francisco if they were given a TPF interface of DBpedia. Maybe a small intermezzo here, what is DBpedia? Well, DBpedia is similar to Wikidata, but the difference is that Wikidata is really manually created whereas DBpedia is automatically derived from Wikipedia data. So DBpedia is in a sense, the predecessor to Wikidata, sorry. Why did I choose DBpedia instead of Wikidata? Well, because DBpedia is a little easier to explain. So everything that I'm saying right now would also work with Wikidata. It's just that the data model of DBpedia is slightly simpler, which makes it easier for me to explain it to you. But everything I say applies to Wikidata as well. So human, we give him or her this question and they can only use a TPF interface of DBpedia, which is kind of like the Wikipedia and RDF. What would an average human do? Well, if I were to do it, I would say, well, first, give me all things where the predicate of birth plays an object of San Francisco. In other words, get me the list of things born in San Francisco. This time it works because we have this extra mechanism. Remember, we couldn't do this on regular Wikipedia because we only could ask, give me things that San Francisco is linking to. But here, thanks to the extra complexity in the interface, we can also ask the opposite, give me things that have us birthplace San Francisco. So once I have this list, I go to the next step, and now of those list of people, let's check for each of them, whether they are artists. So let's say that I had a list of 500 people born in San Francisco. For each of the 500 I check is this guy born in San Francisco. And then I'm left with a shorter list. Let's say I have, I don't know, 100 artists, for instance. And then for each of those people, I can say, well, I have their address, no, give me their full name. And this is how an average human would do it if they were given the TPF interface to the Wikipedia. And guess what? An average machine would do the exact same steps. Now you can ask, how will the machine know what questions to ask to the server? Well, the thing is, the answer is already in the Sparkle query. Those three patterns here are just the same patterns of the Sparkle query. So let me have, let me resume here. If you had a Sparkle endpoint on the server side, you would just take the entire Sparkle query and send it to the server. This is very easy for the client, but it's very expensive for a server if lots of clients do this. What we propose is instead that the client splits this complex Sparkle query into pieces and then sends them to the server. Now this might sound a little abstract. So let me show this to you in a very concrete way. I'm gonna show a live demo on this. So what you're seeing right now is a browser window. And I'm now going to clients.linkdatafragments. No, first, sorry. First I'm gonna show you the data interface. So this is a TPF interface on top of DBpedia. And I can indeed say things like, give me all triples that have San Francisco as an object, and here I have the whole list. Then I can also say, give me all triples that have AdriAvans, for instance, as a subject and they are right here. So I can ask any trouble pattern query, not in more complex just those. So if you now go to clients.linkdatafragments.org, we get an in-browser client written in JavaScript that is able to answer these queries from the browser. So I have the same query here, give me artist born in San Francisco, I want to names. And I also added an extra filter saying, I only want the English names, don't give me the Japanese or Chinese names for that. So when I click execute query, what happens is that my clients right here in the browser will decompose the complex sparkle query into triple patterns, send those to the server, as you can see right here, and display me the results in a streaming way. So let me do that again. When I click execute query, the client is splitting the sparkle query into small fragments, and the results come streaming in here. Now, the first thing you might notice is, well, this is actually a little slower than if you would do it with a sparkle endpoint. This takes a couple of seconds, whereas with a sparkle endpoint, you would get it immediately. Yes, this is true, but with sparkle endpoint, you would have a downtime of one and a half days each month, if you're unlucky. This system is much cheaper for the server, so much less likely to go down. So yes, this takes a couple of seconds, but it's the same amount of seconds to date, tomorrow, tonight, next week. This interface simply does not go down, which was for me much more important than having a server that is fast. Of course, it all depends on your constraints, but for us, this was one of the main points, and I'll explain why in just a second. The final part that I want to talk about is about querying multiple sparkle interfaces at the same time, because what you've seen previously is queries against wiki data, which is one endpoint, but if we try to do this with multiple endpoints, while the problems accumulate, one endpoint is done for one and a half days, so two endpoints might be on for three days, and well, if you have three endpoints, it's just not worth getting started. The good thing is that federated queries are totally native to TPF clients. It's really simple. Instead of asking your questions to one server, just ask the same questions to different servers. It is that easy. So let me show you how this works. Let's try a sparkle query over multiple endpoints at the same time. So what I'm having here is a more complex sparkle query, and basically I'm using three data sources here. I'm having dbpdl, which we had earlier. I'm using vof, which is a database about alters and works, and I'm using the Harvard dataset, which is a dataset of the Harvard library. And the question I'm having is, well, I'm standing here in front of the Harvard library, and I want to read books that are written by people who are born in San Francisco. And if you look at the sparkle query, you don't see anything special. It's just the same things people born in San Francisco. I want to have their name. I want to have the title of the book. But when I click execute query, the client will use those three data sources to answer the question. And that's in fact what we're seeing right here. You see multiple sources and being consulted and the results really come streaming in at a fast rate. Why am I using multiple sources here? But the thing is that each of those sources individually does not have the answers. Dbpdl does not know about what books Harvard has. And vof knows about alters and their works, but vof doesn't know where those alters are born. So thanks to the combination of those three, I can see the answer to complex query. And you might notice that those results, again, they come in streaming, they come in live in your browser. If you try this with a sparkle endpoint, well, I'll be honest with you, I've never seen sparkle endpoint federation work on the web because it's just so expensive, it's just so difficult. With this lightweight interface, it's not a problem at all, which is really cool. And something else that's very cool is that all of the software is written in JavaScript so you can just build live browser applications with it. For instance, here's an example of another sparkle query. And what I'm doing here is that I'm using data from Vivo to say, well, give me organizations from the Vivo data source, and I wanna have their logo, I wanna have their wiki data identifier. So here you see the wiki data identifier of Brandt University, they have logo and they have some other information as well. So what I can do now is I can build an application on top of that. So I can, I'm here looking at the JavaScript application where I say, well, I wanna start a sparkle query over those three data sources and then retrieve the data and do something nice with them. In this case, when I click execute, data is streaming in live and I'm rendering the logos, the wiki data link and so on and so forth. So when I click here, for instance, on the wiki data link, I'm actually being sent to the Butler University page on wiki data and all of this is coming from three different data sources at once. But this is how easy it is to work with live link data from multiple sources on the web right in your browser. And the best thing of all is that it's really affordable. So you don't have to have an expensive endpoint, you can just have a lightweight interface as a small library or whatever and people can browse your data and build applications on the web and everything keeps on working 99.9% of time, which is great. And the other great thing is what I've explained before is that, yes, TPF gives more deep servers, but the drawback is that performance is lower. However, as I told before, I'm a researcher and we measure things like performance and bandwidth and so on. What we've seen is that in federated scenarios, TPF can in fact be as good as Sparkle endpoints for federation and sometimes even be faster. So federation is definitely a totally different scenario and if you wanna make this thing work, I think lightweight interfaces are the way. So today we've talked about Sparkline RDF about why I think lightweight interfaces are very important and I've shown you a demo of federation. And this demo is why I really believe that federation with a TPF interface is a game changer. Because yes, it's possible to have Sparkle endpoints, but they alone will never quit it on the web. So this is why I think really the combination of Sparkle endpoints and cheaper interfaces to do federation. Now, I shouldn't over promise, federation is not always possible. Some queries will always be hard on the open web. For instance, if I'm saying things like, count the number of cats pictures on the web, well, this query will literally take forever. I mean, if you centralize data, it's easy. If you have one endpoint with all data, then you know you'll get the answer. But since the web is open, some queries will always be hard, but that's the challenge that we take. And actually, as I have shown many more queries than you think are pretty fast. So the queries that I've shown were quite complex and still thousands of results arrived within several seconds. Most of all is, even though it takes a couple of seconds, results come in streaming. So if you're building an application, you don't have to wait until all results have arrived. As with a Sparkle endpoint, you can start showing things to the user right away. And the best thing of the whole story is that all of the software that you've seen, all of the specifications, all of the research is open source. It's all available at linked.fragments.org. So if you want to start with your own lightweight RDF server, if you want to start with federated queries yourself, then there's no excuse, you can just start doing it. This is the end of my talk. What I'll just do is, in the blue jeans chat window, I will paste links of the queries that I have shown to you so that you can start experimenting with them yourself. So here is the artist query that I've shown. Here is the query about finding books in Harvard Library by San Francisco authors. Here's a query with organizations and Vivo. And the last link is the quick browser application that shows how you can work with live link data in applications in your project. That's it from my side. Thanks very much for having me. And if there's any questions, please let me know. And I won't be here until the end because it's already 10 p.m. here. So if you have any immediate questions, ask them no. If anything else, you can always send me an email, whatever, I'm very mutual. Thanks. Thanks a lot, Ruben. That was awesome. And we do have, in fact, five minutes. We're pretty much on track at the moment. So if you have any questions, I think we had a couple of IRC. So I'd be happy to relate them. Eric, I think you have one. Is that solved already? What I was curious about is, obviously this is executing many more queries. Some of your examples were executing 200 or 1,000 queries per answer. I was wondering how that plays out with scaling things because that's one of the things I worry about when I'm developing a system is that, if it issues 1,000 queries, that this is going to be a lot of server resources. Yes, that's a very great question. And I have an even greater answer to that if I may say so. I'm going to quickly reshare my screen again so you can see some data on that. So I'm very happy that you asked because like I said, I am a researcher and it's our job to measure exactly those things. So what I have on screen right now is a research paper, which I'll share with you. But I'll get to some of the graphs on there. So we've measured what happens if you have lots and lots of clients. And the graph that you see right here shows on the top half, you see Sparkle endpoint performance. On the bottom half, you see our client server setup. And this is what happens if the number of clients increases. And notice that the axis is logarithmic. So Sparkle endpoint performance drastically goes down if more and more clients arrive whereas our solution starts out slow but at least remains equally slow. We are much better at dealing with high loads. Share the publication through the chat so you can have a detailed look at the results that we have in there. Awesome, I think also Andy had a question at some point about the VF example and the Harvard data set. Andy, are you in the chat? Can you unmute yourself? I don't know if that was Andy, but maybe we can follow up later if you have additional questions. Andy, you can forward them to Ruben. Okay, I think we have time now for maybe a five minute break before we continue. So stick around for the second part of the workshop. Thanks, everybody. Okay, I think we're gonna start again. Tim and Ben, are you guys there? Yeah, we're here. All right, let's use yours. Okay, so thanks for the opportunity to be here with you guys today. Some really nice, technical discussions before this. This talk will be a bit different. It'll be less about the kind of technology and how things work and more about an application of it or several applications of it. In fact, I think in my part of this I'm only gonna do one Sparkle query, but it's a Sparkle query that may save your life someday. So hopefully that's interesting to you. Tim will follow me and he will get into a little more of the nitty gritty of the technical aspects of this work. And so as you can see there, I'm coming in, we're both coming from Scripps Research. I've been a part of this GeneWiki project for the last several years now, although I wasn't there when it got started. And so let me explain what that is. See that slide, please. Okay, and so before proceeding further, just to make it clear, the GeneWiki project is a large group of people. I wanted to highlight the people that are most actively involved right now, particularly Andrew Sue who is our boss here at Scripps and who is the one that really started off this project quite a while ago. Sorry, my screen's just freaking out here right on time. Sorry, hold on, technical difficulties over here. Okay, and then Andro Agmister, who's actually our Sparkle guru on the team and is a consultant for our project. And then two postdocs that work here at Scripps, Sebastian and Tim, we'll be hearing from shortly. So this is all thanks to their work. And so the point, so coming back to the GeneWiki project and its purpose and really all of our purpose here, doing bioinformatics work at Scripps Research or around organization of knowledge. I said curation of knowledge in the title. So here is an example of knowledge about a specific human gene and as it's presented to most scientists that we work with. So this is a query interface at PubMed, which is the central repository of journal article abstracts for the life sciences. And so if you query for this particular gene, you'll get many thousands of results there. So we have quite a bit of knowledge to organize and that's our job. And of course, this is important. This is the foundation for all science, all the drugs and so forth that like I said, may eventually save your life. And even to this date, despite all of our efforts and any other people's efforts, most of it remains represented in the text of these journal articles. And so to give you an idea of the scale of the work that we're doing here within the life sciences, according to PubMed, that resource there, we publish about two articles every minute. And that's only the articles that are written in English that make it into that repository. And so that means that we're at more than a million per year and growing rapidly. And so it's quite a challenge. And from that challenge, the GeneWiki project was born. And so the purpose of the GeneWiki is to take all of the information about human genes at least that's how it started, coming into repositories like PubMed and turn them into a basically review article for every human gene. So that when you wanted to know about this particular one, Fibronectin, instead of reading 30,000 journal articles, you go to one place and get a synthesis of that knowledge that would then link you back out to the important places where that information is located. So this is really the goal of the project at a high level. And so this is something that started, like I said a while ago, and these are examples of what we refer to as GeneWiki pages all day just like any other page on Wikipedia. And so what we have done, what the group has done is add quite a bit of automation to the creation and maintenance of these articles based on the structured data that we have access to from other database. And so GeneWiki articles are created as stub articles automatically by our bot. The stub will contain a little one sentence summary of it. And importantly, all of this information here on the left about the structure of the protein that's represented, various ways that genes are represented in other databases, links to those databases, where the genes expressed in the body, references where all this information will come from and so forth. And the idea is that by creating these templates, these basic structures and keeping all the information in them as up to date as we can, we provide a landing place where people can come and get a little bit of information and then they can go ahead and when they're able, fill in the text that will form the article. And that was the basic idea. And so to give you an idea of where we came from, where we're going and where we are, this project started out in 2007, around 2008, they have approval and executed the first very large bot run using a bot called the protein box bot and created articles for about 9,000 human genes. Also, when those gene articles already existed, they basically updated them so they are all using that info box template. A year later, we're approaching 10,000 genes and we made a big update to the bot. 2011, this is about when I came into the mix here. We were up over 10,000 and as some of you work in my postdoc, we basically showed that the project was essentially working. That in addition to articles growing, the text in the articles was growing so people were actually doing what we had hoped to do. And so bit by bit, we're accumulating that trove of review articles. And there's also quite a bit of analysis done on the quality of the text, very low vandalism, even compared to the rest of Wikipedia, which is also fairly low vandalism. Now, where this starts getting interesting here and I think interesting to me about two years ago now, we got an NIH grant that would support us in this work going forward. And at the same time, Wikidata started to become a useful thing. And so we started moving our data that we were managing into Wikidata. And a big thing that happened in this project earlier this year, we were able to convert all of our efforts within Wikipedia for human genes, such that they draw all of the information that produces those boxes that you see there, the pictures, everything from Wikidata. And the other thing that's quite exciting this year that Tim will be talking about after me is that apart from Wikipedia, Wikidata is now being able to drive other applications. And we'll get into that later. A lot of this is tracked and there are references on the portal GeneWikidata page on Wikipedia. So let me explain for Wikifolks here what this means to us and I'm sure you can appreciate it. So that template that I described there looks like that middle panel there in Wikitext. It's many embedded templates within them to get it formatted properly and so forth. The way it was done initially meant that we had one version of these for every one of those 10,000 plus articles we maintained. So to keep them up to date involved quite a bit of error prone parsing and processing. And it was basically just a really terrible way to handle data, but we did it. All of those 10,000 articles now are represented by this one Infobox Gene template. So that has been now instead of maintaining the template the data inside of these templates we maintain the data in Wikidata and we worry about formatting it and showing it in the Lua code that runs that template. So that was a huge just technical step forward for us. And I think it's fair to say that we're one of the more advanced uses of Wikidata on Wikipedia right now. So I think we're big Wikipedia fans we've been working on this for a long time. But when we look at Wikidata, I think the impact the potential impact is a lot goes far beyond what is going to happen within Wikipedia. And that impact will be mediated through that sparkle endpoint. That's the language that other apps are gonna speak when they grab data from it. So right now, just to give you an idea of what we've done and what we're sort of working towards maintaining in here, right now within Wikidata, we have items in there for every human and every mouse. And in fact, every macaque and every rat gene that's known that we know about and associated proteins and gene products for all of those organisms. We also have all of the gene ontology terms. So the gene ontology is a reference vocabulary for describing the function of genes in terms of their localization molecular function and the biological processes that they operate in. So all of that was just about 40,000 terms. Those are all entities now that can be used within Wikidata. We've done the same thing for the human disease ontology, which is just what it sounds like. It's about 9,000 terms. And we've also imported all FDA approved drugs. And there's been a significant investment in time recently in expanding to other chemicals. And as Tim will talk about later, we also have more than a hundred reference microbial genomes. So this is important for the microbiome research that many of you have probably heard about a little bit. So what we're doing here is what we hope to do is be planting the seeds for a network of knowledge to grow within Wikidata that everybody can share in. And so although I would say that that now remains pretty thin, we have got to the point this year where we can start doing some interesting queries. So there is a link there at the top if you wanna explore a kind of long, relatively unstructured list of things that we've been playing with. But just to give you some basic examples, you can ask basic queries like, where in the cell is the Relin protein expressed? So is it in the nucleus or is it in the membrane? So forth. What diseases does a particular drug treat? And the one query I wanna show you is very simple. It's a query for something we don't know yet and what diseases might be treated by Foreman. So this is an example of a scientific use case of the data in Sparkle graph and that is actually in Wikidata. So the question is, okay, I have a drug in this case it's met Foreman, what diseases might it treat? And so we already know that this is a diabetes drug. So could we use it for other things? This is a big business right now in the world of drug repurposing. And so it turns out that we have a pattern which you can find in the literature that shows one example how you can find new candidates for repurposing. And that pattern runs like this. So if that drug has a physical interaction with the protein, that protein is encoded by a gene that has a genetic association with the disease, it turns out that that drug turns maybe related to that disease and may have an impact on it. And so we can ask that question with this Sparkle query. So I'm looking for genes and diseases here to fill in the pieces of this graph. And I'm saying, okay, Q1, 9, 4, 8, 4 is met Foreman and Wikidata. 129 is physical interaction. And I'm saying, okay, so met Foreman interacts with this protein. This protein then is encoded by this gene and this gene has a genetic association with this disease. Execute that query, which you can do, you can click on the tiny URL there in the slides I shared in the chat if you want. And you'll get double diseases back along with the gene that linked met Foreman to them. This was a nice example as it turns out that people are actually researching the use of met Foreman for prostate cancer in the recent literature. And so it validates that there are some interesting things to be discovered there. Now this is a pretty simple query over what I would again, I would reiterate it's a fairly light representation of the knowledge in the world, but it's a nice starting point for making what I hope is a good example of how we can use this for science. And so with that, I wanna transition over to Tim and he's right here in the same slide so he can just jump on the computer and continue on. One second. Okay, hello everybody. Thanks for the opportunity to talk today. This is really great to get an overview of other people's work in Sparkle and wiki data. So I am a molecular biologist by training and I'm a postdoc here in Andrew Sue's group working with Ben on loading bacterial genomes into wiki data alongside the human mouse and other mammal species that we've loaded to wiki data. So really hot topic is microbiome and a link data model is a great way to explore the microbiome. So this is where my focus is. He already really did all of our acknowledgments. So I'll use his slide for that, but acknowledge Ben who just spoke and jump right into it. So real quick on the data I've loaded into wiki data and the structure of it, which makes the Sparkle queries make a little more sense if you can see the overall structure of the data. We have already in wiki data, before I started, there was bacterial lineages existing in wiki data where a species parent taxon is the genus. It's parent taxon is the order, et cetera, going all the way up to the domain, which is bacteria. So what I came here to do is to add genomes. They're genes and proteins of bacteria to wiki data to create that stub. Like Ben mentioned for the Gene wiki project that people can add to semantic relationships and things like that. Interesting things, the drugs that treat them, the diseases they cause. And so we really wanna create this structured data model. And so I created an item for the genome that was sequenced, which is essentially a strain of that species, and then linked all the proteins and the genes to that through the found in taxon property. So the hierarchy is shown here where you have the item for chlamydia trichomatis, the strain, which also represents the genome that was sequenced. Through the parent taxon property, you ascend through the hierarchy and if you kept going, you'd get to bacteria. And if you go in the other direction, the found in taxon property links genes to that genome as well as the proteins. And those are linked to each other through the encodes and encoded by property. So it's really this stable structure that can be built off of. And our whole purpose is to build a data model that will aid in biological research and to also provide a platform for basic researchers to consume the data in a way that has context for their work that makes sense to them rather than going into WikiData's interface and clicking through things. They need something that's a little more intuitive for their work. And we also want to have them provide information, edits and statements and things like that. We'd like to provide a way for them to push those to WikiData as well. So we're making a web application. It's called the centralized model organism database. And so when you load the application, you see that there's a form there and it says start typing the name of an organism to continue. Right now, as I'm the bacteria guy and I'm building this, it's only contains bacteria will include more later. It requires tweaking of the model a little bit. But what you do is you start typing and when the page was loaded, this query is executed through an AJAX GET request. And what happens in this query is that it's looking for select any organism essentially that has a parent tax on bacteria. And so this asterisk means that you're recursively going back through this model all the way to bacteria through the parent tax on property. So essentially what it's saying is get me any bacteria items in WikiData and then we're going to narrow it down by saying select only the ones that have an NCBI taxonomy ID. And then we're going to narrow it down to those that have their genome sequence by saying give me a genome RefSeq ID from NCBI. So these items all have been given these core identifiers and then of course the service to get the labels. And so we get a list essentially of all the bacteria in WikiData that have had their genome sequenced. And when you start typing, you get a dropdown list of options. You can see the identifiers involved, you click on one. And what happens is the page is redirected and it executes another Sparkle query. Now this Sparkle query uses the tax ID we just got from selecting that organism to then get all the genes and proteins, at least the identifiers for all the genes and proteins that are in that genome. And so give me the organism based on the tax ID and then select gene items that are found in that organism. P703 is found in taxon. And then this is a list of identifiers and annotations in the entry gene ID, select for the genomic start and stop position, a locus tag is another identifier. And so you're getting this basic data about the genes. And then it also has an optional block here that says if there's a protein codes, I'd like the information for that as well. So you're getting this large amount of information for this bacteria when you select it. And then it redirects to this page where it's actually using that data. So real quick, that organism data on the first query is displayed here. This is a place where we can add more information about that organism or a user could. Down here is where the gene information is displayed. You have identifiers, the wiki data ID, it all links to the different databases they come from, information about the gene annotation. We have built in an open source genome browser called Jbrows which visualizes where the genes lie on the genome. And this is novel that it runs off a wiki data sparkle endpoint to get the annotations as well. And then the information about the protein that gene encodes will come up in a box down here. Now we're not really interested in this protein. This is just auto loaded on page load. So we have another search box up here where you can search for any of those genes that we gathered with the last sparkle query. And it gives you another dropdown box just like the one you saw before at this time for gene information. And so what happens when you select a gene to populate this protein box down here, a few queries are made for that simple gene. It takes the QID, excuse me, it takes the Uniprot ID. This is find an item, a protein item in wiki data that has this Uniprot ID. And then what we're doing here because references are so important to what we do, we need to very clearly reference where this information came from for the scientific community to care about it. So what we're doing is we're using the property, the P prefix to get information about the statement itself. So on that protein, select a statement that says that it has this Interpro domain. So that statement is gonna have qualifiers and references attached to it. So this is how we get that kind of information. So on that statement, I'd like the value of it. PS is the simple value of it, which is the wiki data Interpro domain item. So it's another item that we've created in wiki data that has biological relevance that this protein has as a property and we link to that. And so now I've gotten the statement and I'd like to get all the references so we can display the references on our application as well. And so we use prov was derived from and then you can display the different or you can use the different properties you'd like to return for the statements. So this gets you the reference that stated in what database was it stated in? What was the publication date, the software version and a reference URL, for example, for this Interpro one and then give me the Interpro ID as well. And this is pretty complicated query, but essentially what it returns is this information and then it's rendered in the protein box on the web page. So here is the Interpro items label, the Interpro ID. And then if you click on one of the reference buttons you get the reference that's in wiki data for this item, where we got it. We got this information from this reputable database Interpro is taken on this publication date, the software version and the reference URL as I mentioned before. And I'll show what that looks like in wiki data for another statement in a minute. And now what we really wanna do is provide basic researchers a way to add information to wiki data in our data model without breaking the model. We've put a lot of time and research into developing a very stable model. And it would be really hard to train people to do that properly with just the wiki data interface. So what we've done is we've developed some web forms on here that will allow people to with a few clicks make very defined statements. So if they find a piece of information in a PubMed article, like a molecular function that this gene or that's protein has as an attribute they can add that information themselves if it's not already there and cite the article they took it from. So they click on the add molecular function button or essentially, excuse me, before that they have to log into wiki data and eventually this will use OAuth right now. I've just created a form to use my credentials but eventually when it's live people will be able to use OAuth to get wiki data to authorize the application to make edits. So you log in, you're logged into wiki data and you have this form, sparkle queries will find for you the gene ontology term which is the representative item of the function that you're trying to add to this protein. So you start typing the name of it you'll get that function you click on it it finds the QID for that and places it there. And then we'd like to add a qualifier what is the evidence code? Was it inferred by experiment or by sequence alignment or something like that? So you can start typing the name of that and you'll get the evidence code that they're interested in. And then you put the PMID in and it finds the proper paper from NCBI and you click on that and you confirm that that's the PMID the paper identifier you'd like to have as part of the statement you're gonna add and then they just hit edit wiki data. And what happens is it takes a few seconds it tells you that you've successfully edited wiki data but it does take a few minutes for the sparkle endpoint to be updated so it will be viewed in the application when we refresh. So essentially you can go right into wiki data though and this is what that kind of a statement would look like that someone just added. This property has the statement molecular function is the predicate it's a protein kinase binding protein. The determination method is this qualifier that we added which is EXP which means it was inferred from an experiment and then a very rich reference to say that it's stated in a scientific paper the work is in English here's the ID for that reference which is the PubMed ID and we've added this imported from the centralized model organism database reference to know that the actual annotation was made from the portal created and what day it happened as well. So this really is a way for someone that doesn't have experience with sparkle and essentially I'm saying basic biological researchers aren't likely to learn sparkle to navigate the graph themselves they aren't likely to be able to go into wiki data understand the data model and add data to it without or with it and properly reference so we really need to provide those tools for people to do that and define the edits we really need help with so we can crowdsource annotating a lot of the information out there that's buried in text and so we can make it structured data and with that I thank you for your time and I don't know if we have any questions right now. Fantastic, thank you folks so that officially ends the presentation section and I think a few of our speakers are gonna stick around. Lucas also here is the ultimate chief sparkler who can help us like soul or questions and yeah so first I wanna ask around if there are specific questions to this talk or more broadly anything we've seen before and other than that we can just over examples there's no agenda for the remaining part of the meeting and it's not gonna be recorded or rather it's gonna be trained before the video is published.