 So, as Fredric mentioned, I'm going to talk mainly about one specific tool, so that's a substructure search. So I'm going to kind of, oh, sorry, my cursor isn't working. Okay, so I'm going to talk about what it is, why we needed it, the solution we came up with, how we formulated the solution, and how it's integrated with the rest of our resources, and then a few examples and a live demo, and hopefully questions at the end. So what I'd ask as well as as I go through the presentation, if you think that maybe you have an analytical question that might be solved using these tools, I'd really appreciate if at the end maybe you'd ask how we solve some of those questions, because we have our own idea of what the questions are, but obviously we wanted to tailor the user interface, so that it's useful for experimentalists. Okay, so Fredric did a really good job of presenting GliConnect, and all the tools we have at Xpasi. So what GliConnect is based around our glycoproteins, so here I show we have glycan structures, so we have a number of different types of glycan structures, and we have varying degrees of information about those structures. These structures are linked to proteins in some way, so again going back to the age of the very relevant or used to be very relevant example of COVID, so there's two proteins listed here, there's the human angiotension converting enzyme, which is the receptor, and then there's a spike glycoprotein which is found on the virus. So we have a glycan ID in the form of either the GliConnect ID or a Gli2CAN ID, we are linked to a protein which has a uniprot ID, and then of course there's also these are related to a disease, in this case it's SARS-CoV-2, which also has a disease ID, so it's a DOID, so disease ontology ID. So if this is the story that you're looking for, so you're looking to see what glycans are found on the proteins that are associated with a certain disease, that's quite doable on GliConnect, very easy to go to a page and pick all of these things, and it's presented in a very nice user friendly way, it's not loads of lists, it's very graphical. However, if the question that you're looking for is a little bit more complicated, so for example, are there any Lewis-A type structures associated with COVID? As Frederic pointed out, if the Lewis-A type structure is well defined, that's not really that much of an issue, it becomes an issue when you have something like this, which is undefined, so it could possibly be a Lewis-A, but we don't know where that fucose is, so we can say whether it is or is not. So we have GliConnect structure data that's linked to proteins, linked to tissues, linked to diseases, linked to sources and to publications, and with all of these there are some sort of IDs that allow you to navigate the data. So to turn the question around, so you could also ask what diseases are associated with a Lewis-A epitope, for example. So across these two questions, the thing that they have in common is that we're looking for a substructure, a small piece of the structure that matches. So a substructure search, so a few years ago, Frederic and her group looked into how they could do this with GliConnect, and they investigated a number of different approaches, and one of the ones that they tried to implement was using RDF and graph depictions of GliConnect. So the issue is that with the experimentally derived data that we have for GliConnect, there's a number of different levels of resolution, so if we have NMR, we do have NMR structures, they have a high level of knowledge associated with them, we have all of the linkages, we have NMR connections, we have the type of isomer, whether it's galactose or glucose, for example. Problem with NMR experiments is that you need a lot more sample, and so the kind of benchmark is mass spec, and there's levels of detail that are missing with mass spec sometimes, and then you go down to GliConarrays, you have digestions, you have chromatography, so these all have different levels of structural annotation associated with them. So how do we address this? These blanks in the GliConn structure is, if at the end of the day what we want is a substructure search, we have to be able to in some way address the blanks that are also there. So just to give you an overview, I'm sure you're all used to this, the image on the right hand side, which is the SNFG representation of glycans. So the square means something, the color means something, and you can see some of the linkages are actually annotated there. Behind that cartoon is what you see on the left hand side, which is a glyco CT string. So the one on the right is human readable, and immediately we can tell what we're looking at. The one on the left is computer readable, so for a computer this makes more sense, but it is also a very human readable version of the text. So if you see the, if you can see my screen here, here we have a residual list, which is a list of all the saccharides and any substituents that they include, and here we have the linkage information. So this information here links all of these, so this is what's known as a connection table. If you'd like to read a bit more about that, there's a paper, it was because the glyco CT is around a long time, it was brought out in 2008. So it's become a kind of a standard for depicting structure, glycan structure in a text format. So however, the drawback here is that it's not easy to search for small structures within this, because it is a text format. Okay, so I'm going to segue a little bit now and talk about semantic web. So Frederick mentioned this earlier. So the semantic web is, it's kind of like imagining the entire internet is linked in some way. So the internet in and of itself isn't much use unless there's some links between data resources. Okay, so the semantic web describes a common framework that allows data to be shared and reused across different applications, different domains. And the technologies that are associated with semantic web are ontologies, and RDF which stands for resource description frameworks. So there turns out that there is already a number of ontologies that are available in glycoinformatics. So because of this, it was a kind of a good path to follow in terms of trying to depict glycan structures. Because if there's already ontologies available for other parts of glycoinformatics, surely if we develop an ontology for glycan structures, we can link all of these parts. So just a few definitions. An ontology is a series of concepts and categories in a subject area or domain that shows the properties and relationships between them. Basically, an ontology is a vocabulary or a language if you like, and it's the model that we're going to use to model our data. A knowledge graph then is specific instances of this model. So if we're talking about glycan structure, a knowledge graph contains instances of glycans described using the ontology. The resource description framework or RDF is a type of standard for actually describing the resources and how we exchange data between resources. And then the last thing is Sparkle. So I'm going to mention Sparkle quite a lot actually. So Sparkle is a standard query language for any kind of linked open data, not just what we have developed here and RDF databases. So it's a method to query our data. So this is very similar to the picture that Frederick had shown you. It's a nice overview. I mean, obviously, I don't have the gags here. I'm kind of limiting it again to glycoproteins, but it's a nice kind of snapshot of if you think about glycopyology in how it spans different domains and different resources. So we have some kind of tissue or cell membrane and within that is embedded some glycoprotein. The glycoprotein has experimentally derived peptides and our sites attached to this site is a glycan. The glycan has obviously some kind of composition. And as well on the glycan, there is the possibility of some kind of epitope or in this case, it's called a ligand or a determinant that is recognized by a glycan binding protein such as electum. Okay. So as I mentioned, there are already a number of ontologies available in glycoinformatics. So the first one I'll talk about is the one on the right is the glyco-conjugate ontology. So actually, Glyconect has incorporated the glyco-conjugate ontology to describe the publications that we have, to describe the proteins, the diseases that may be associated with them. So we have actually already implemented this ontology in Glyconect itself. Then on the left, you see the glyco-RDF ontology, which is more concerned with experimental details such as the peptide, the amino acid sequence of the peptide, the composition perhaps of the glycan that's been solved associated with it. There's also an ontology for sugar bind, which is a binding motif database that's also available on Expasi. So missing from here is a glyce, an ontology to describe the glycan structure, which we have called glystream. Okay. And I just want to draw your attention to the fact that the lectins, there's actually an ongoing project as well to develop an ontology for the lectins. So that's hopefully for next year. But what I'm going to talk about is the glystream ontology to describe glycan structures. So back to, why is this important? So if you think about something like how would I look for a galsialic acid epitope in a structure? If I look just at the glyco-CT format, which I have here, you need to parse the text in some kind of meaningful way. You need to know where the structures are in relation to the rest of the structure. It's not the easiest thing to do. So if I have, sorry that the image is not very good quality here. If I go back to my COVID-19 example that I started with, here is one of the structures that was available with an unknown section. So if I just show you the residues, so the first two lines on the residue list actually are pointed to the glick-knack, the first core glick-knack that's on the structure. And the linkage information then is here, the first line in the linkage text here. So again, how would I search in this glyco-CT for a Lewis A-like structure? It would be quite difficult to do that because you'd have to identify that there was a glick-knack somewhere, would a gal linked to it, and that was part of a larger structure and there was a fucose. So the fucose in this depiction here is unknown. What that means is that this fucose is definitely part of this structure, but we don't know where it is. And that's in glyco-CT that is defined by this undefined section here, which is very difficult to parse. So we developed an ontology or a model that describes a glycan structure and that makes it easier to search for patterns within that structure. So this is a graphic of the model. You don't really need to worry about that so much. There's just two things I'd like to point out about it is that the entire glycan structure is described by a glycan. And this object describes all of this. So all of this information is contained in the entire model and at the top of the model or the tree, if you like, is the glycan itself. And the second part, which makes it very, very powerful, is that the undefined section is also held in its own object, a glycan bag in this case. So what that means is we now have three representations for a glycan. We have our SNFG, which is the visual cartoon. We have our glyco-CT, which underlies this cartoon and also underlies the model, the RDF model that we have here. So these glycan structures are now represented as graphs. So if you imagine the residues are nodes of the graph and relationships between the residues are seen as edges and in those edges are held things like, for example, has a child. So the core Glocknak has a child that is a fucose, for example. The anomericity of the linkage is held as information in that edge. The carbon that's involved in the linkage is held in that information as well. So we can then create patterns that we want to look for in the glycan structure. And we do that using the Sparkle language, which I mentioned earlier. So we then looked at how we could apply this solution, first of all, in our in-house to our own database glyconnect. But why was this useful for us? So one of the first things we realized was that, oh, sorry, how we actually formulated it, first of all, was we took this glycan, for example. This is the glyco-CT that's associated with it. We generated what are called triples. So this is basically another text file, but a computer-readable text file that describes every single relationship that is within this glycan in terms of the model that we generated. And then this was stored in what's called a triple store. In our case, we used graph DB. And then this graph or knowledge graph, we can query that using these Sparkle queries, for example. And a Sparkle query, once it's executed, will give you a list. In this case, there were glycan IDs. And these IDs link back to glyconnect, where you have a picture of your glycan. You have the protein associated with it, the source where it came from, and if there's a disease, and so on. So again, we extracted the glycosct strings. We transformed them into glystream individuals. We output the data, stored it in a triple store. And then we were able to query the endpoint. So one thing I want to make absolutely clear is that glystream is the name that we're giving to our ontology that describes glycan structure. Okay, so there is an ontology underneath glyconnect, which is all of the data. So the protein and the disease associated with it, the tissue where it's found. And then glystream complements that by describing our glycan structure. So the reason this is important is because the linked data between glyconnect and glystream allows us to do what are called federated queries, which means with one pass, I can pull out, for example, glycans that contain a certain epitope that are linked to a certain disease, for example. But also what this was useful for was first and foremost, validating our data that we had in glyconnect. So one of the first things we did was we generated all of the compositions from all of the glycans that we held in glystream. And we were able to cross-reference that to the compositions that are held in glyconnect. Now this might seem like not so important, but actually with over 5,000 structures, we couldn't have done it manually. And we also had imported certain structures from publications where they provided gly2 can IDs, for example, which made our life as curators a lot easier. However, we did need to check and glystream allowed us to do that. So we were able to pull the composition straight from glyconnect, generate the compositions from glystream, and compare the two. Similarly, we were able to do the same for, for example, glycan types. So this is an N-linked glycan. We also had O-linked glycans. So we were able to generate a query that mimicked the pattern of an N-linked structure. So with two Glocknax and three man core, and search for the glycan IDs that contain this, and ensure that they were labeled as N-linked. Again, with the cores. So I'll just show you a little bit about that. So the Octopus is a tool that we have on glyconnect that allows you to look at, for example, N-linked, O-linked or free oligosaccharides, different core types, different properties and different determinants. So we were able to validate all of the information and the mapping that we have within that tool using glystream. This would have been ordinarily impossible with just like using just glycoct. So this is the properties. So the properties are, for example, for N-linked, we have, sorry, this is the cores. So for N-linked, we have the high mannose core, complex core, hybrid core. We also have these properties, both structural and compositional. So for example, a few-casolated structure is easy enough to assign using just composition. However, with core few-casolation, you need to have some kind of idea of the structure. So this core and non-core few-casolation structures, this was possible to assign using glystream. And then we were also able to update our list of epitopes. So again, for each of these kind of epitopes, determinants, or ligands, whatever you'd like to call them, we were able to generate a list of glycan structures held in glyconnect that contain one or more of these epitopes. So the reason, so that was one very big use that we found in-house for using glystream. However, a use for outside of our group is the use of federated queries. So federated queries are queries that link different resources. For example, glystream and glyconnect or glyconnect and uniprot. So they allow linking of different services. So for example, if you had glystream and glyconnect, you could look at glycan structures on a certain protein. Or if you were looking at glyconnect and uniprot, you could look at the glycoprotein and then delve a little bit more into protein sequence information. So for example, what other PTMs are available on this protein besides glycosylation. So examples of these are on our glystream wiki page. So some of these queries. So I'm just going to talk about one in particular. And I'll do a little bit of a demo afterwards as well if we have time. So one of the questions we had was show me the glycan structures and the available end glycosylation sites that have been annotated for the spike protein of SARS-CoV-2. So this is a federated query which is actually, okay, so it just, there is an example of that here in the RDF page of glyconnect. So there's a number of different ways of running these queries. So what I will say about this is that I will run through this kind of way of running queries because that's what that's available to users at the moment for more complicated queries. We also have a user interface for simpler queries. But as I said in the beginning, if you have a particular question that you think might benefit from this kind of approach, please feel free to contact us because we're very much interested in real life example or real life applications of this technology. So here we have, if you go to, if you navigate to our RDF page, there's a number of different types of queries. Now these particular ones here are for glyconnect itself. They're run in terminal. So you enter this command here with the particular query that you're involved in that you're interested in. And, and it gives you, so I have the query up on top here and it gives you this kind of results down here. So you have a list of structures. So this is useful enough. However, with glystream and using graph DB, we were able to generate a more kind of useful federated query. So this particular one looks at COVID proteins. So the spike, sorry, sites on the COVID spike protein and what glycans are available for each of those sites. So this is an example of a federated query because here at this section, it actually talks to glyconnect. So it's trying to get protein information and disease information. And this part here pulls from glystream, which is our ontology for glycan structures. And what this did was it allowed us to have a list of glycans that were relevant for each site. However, that in of itself is maybe not so interesting, but what we were able to do then was actually group the glycans again using queries in glystream and describe the sites using these patterns. So I'll show that just one once I'm finished here with the talk. So sparkle, the sparkle language is it's not amazingly difficult. However, it's, it would be a bit too much to expect users to actually try and learn this language. So the first thing we've done is we've given examples of sparkle queries that are possibly useful for different for experimentalists. So as I showed there, there's one set of queries available at glyconnect.expasi.org forward slash or DF. These queries are for glyconnect itself. The second set are available at forward slash glystream wiki. And these are sparkle queries that query the glystream triple store, which is the glycan structure ontology. However, we also have a simple user interface that's available at this link here. And this is what it looks like at the moment. So it's a very basic substructure search. You need as Frederick mentioned earlier, you do need to have a glycol CT string, you enter your glycol CT string here and press search. And there's two flags, two options. So they're very simple options, but fairly powerful as well at the same time. The first one is if the substructure is located at the root. So what you're saying there is that I want this to be for example, an ending to core. So I wanted actually at the reducing end of the glycan. The second one is if you have a structure and you want only that structure. So I don't want it to be a pattern within a structure. I want it to actually be exactly this structure. I want to look just for exactly the structure. Okay, and then there's a search button. So how to use the interface you need to generate values glycol CT using a drawer. So there's standalone tools such as glycol workbench grits or glycan builder, which all allow you to export a structure like this in glycol CT. There's also some web based tools such as the builder at sugarbind or the glycon AV draw, draw web to and then you export in glycol CT format, such as this, for example, if I was looking for this structure. And just to point out, I don't think I made it very clear, you don't need to have all of the linkage information assigned, unless you are specifically looking for that, you can use, you can just leave the information out and it'll still search for that pattern. So you enter the glycol CT here. And so for example, we want to look for this. It was a bisecting core few causulated end linked core. You click that you wanted located at the root and you press the search button. Okay. So what I'm going to do now is I'm actually going to show you a few examples of this in in use. And so for example, you could look for as I showed you there a bisecting look back with a core few calls, you could search for all structures containing polycyalic acid or containing poly lactose me. Okay. And just again, I want to reiterate if you have any analysis data analysis that you think might benefit from this kind of approach, please feel free to contact us. Okay. So I am just going to this over here. There we go. Now, the first thing I want to show you is how to generate your glycol CT. So I'm using the sugarbind builder. Or I have I thought I had like one of you up there as well. Anyway, they're all very, very similar. Or you can use a standalone tool. So I draw my structure. And I'm not actually going to put any linkages there. And so this is the structure I'm going to search for. And if I come here here to file an export, I can export glycol CT, and it'll ask me to save the file. Okay. I'm going to leave that for the time being. So then once I have my glycol CT string ready, I can actually, wait a minute now, which one is it? Okay, we'll actually try. Okay. So I have a glycol swamp as an internal substructure. I don't click the substructure located at root. And I don't click max match with exact number of residues, because I want to find it as part of a substructure of a larger structure. And when I click search, I get 431 rules. So I get 431 results that contain this structure. So I just want to point out that when you enter your glycol CT, not only are you given the results, but you're also given a cartoon of the search string that you entered to make sure that you're actually searching for the thing that you taught you were searching for. Okay. So you can see all the different structures that contain this polylactosamine. Okay. So we'll just clear this. And this is the glycol CT for, as far as I remember, this is the one for the bisecting core. So I'm actually going to say that I want this to be located at the root. And I search. And here you see it's generated the cartoon to show me what I was searching for. And then it shows me that I have 137 results that have this pattern included in the structure. And then the last one, which is a really nice one, I think, if I wanted to look for this particular. So again, I don't want this located at the root. So this is a polycyclic acid. So a cialic acid linked to a cialic acid, which is quite unusual. And we have 15 examples of that in Glyconect. So what you see in the results here is the cartoon of each of the structures. We also have an ID. If we click on this ID, we're brought to the page for that particular structure, where we can find information about the protein, the disease, if it's involved in a disease, the reference for which this was found. Composition information here, a gly-2-can ID for the structure, the glycocet, the IUPAC, and so on. So I have one more. Sorry now, just get my, here we go. So this is, hopefully, we think that this is useful for people if they want to look for substructures in the database. However, our long-term goal is that we can create federated queries between Glystream. So this particular, the ontology for the Glycon structures, Glyconect, Uniprot, forward to Unilectin, for example. So that's why we're working on an ontology for Unilectin. So I can potentially pull out lectins that are, that recognize a certain epitope and the diseases that that epitope are found in. I mean, that would be the long-term goal. And as of yet, we don't have the lectin ontology totally finished, but we're able to do a certain part of that pathway. So we recently uploaded to bioarchive a paper that kind of outlines this a little bit more, and it's called This Is GlycoQL. And I just want to show a little bit of that. So we already mentioned that we have a COVID data set, Fredric mentioned that within Glyconect. So it's a subset of the data that's related just to COVID. So what I was able to do was to generate a federated query that looked at the pulled out the recognized sites from Uniprot for the spike protein and linked that with the data in Glyconect and pulled out the glycans for each of those sites. So first thing we were able to do was to kind of generalize in terms of, well, are these structures, high-manus structures as in here? Are they some kind of hybrid structure? Are they a complex structure? Are they silalated? Or I also met a group that had just one Glyconect. So it was hard to say whether it was either a complex or a hybrid. And initially you can see some kind of profile there. So then what we did was I created a profile across all the sites that were recognized. And while this was very useful, it was way more useful when I was able to look at, sorry, I don't have the figures on their own. So I'm showing you the paper. When we were actually able to look at the sites individually. So these are the sites for the difference actual proteins that we held that represented the spike protein. And the differences between the four proteins was the cell line that they were expressed in. So all of the spike proteins that we have are actually expressed proteins rather than native. And the cho and the hex cells show some similarity here. These two were just the RBD domain. That's why they're lacking information about the other sites, whereas this one here actually contains the receptor binding domain. But the big thing you see is that this insect cell line really bears no resemblance to the expression on the other three cell lines. So we were able to pull out all of that information with one query. It took some manipulation to create the graphs, but in terms of actually gathering the data that was made a lot simpler by the use of the glystream ontology and linking the data and federated queries. So that is mostly what I want to show you. And if there's any questions, or if anybody would like to see a little bit more of the substructure search, maybe.