 Hello, my name is Catherine Hayes and I work in the Proteome Informatics Group based in the University of Geneva. And today I'm going to talk about glycan protein interaction motifs. So first a few words about glycobiology and glycosylation. So a glycoprotein is a protein that has linked to it a carbohydrate, also known as a glycan or a sugar residue. And this sugar residue usually has some kind of binding motif that is recognized by a glycan binding protein. Now in our group we have a couple of resources, one of which is GlyConnect and GlyConnect is mainly concerned with the glycoproteins. So it's a curated database. We take publications that solve the structures of glycans and glycoproteins. And we also store the metadata around these. So things like the taxonomy, the disease that was associated with the glycoprotein, the tissue that it was found in and so on. Our other resource is Unilectin. And Unilectin is again a curated database of glycan binding proteins. So lectins are a particular type of glycan binding proteins that have a non-catalytic binding domain and they reversibly bind glycans. And they're involved in recognition events and the recognition is for this glycosylation pattern that's found on the top of a glycan. So we can imagine a biological question that would perhaps ask, how could I access information on both sides of a binding motif? So information about both the glycoprotein and the lectin. So this talk is aimed at somebody who might be asking these kind of questions. So a biologist or a glycopriologist. So what we need to do is allow substructure searching of glycans that are associated with glycoproteins and the lectins that recognize them. So how to create a link between our two resources, Glyconect and Unilectin. So a little bit about the structures themselves. So a carbohydrate structure and oligosaccharide is shown on the right hand side in the kind of usual chemical structure that's found. Underneath, you see what's called a smile string, which is the text format behind this representation. On the right hand side, we have what's more usual to glycopriologists. This cartoon format that you see here is the symbol nomenclature for glycans, which is a standard format for displaying glycans. So all of the chemical information that you see on the right hand side is encoded in the shapes and colours that are shown on the right hand side. So this cartoon representation has behind it. There is many different kind of string formats, but the one that we use in Glyconect is this glycoct, which is a connection table. So we have a list of residues followed by a list of linkages. Now the structure that's displayed here is an ideal in that all of the topology and the sequence information has been solved. So you can see all of the linkages are shown there. However, a more realistic view of a solved glycan structures is one like this, where we have five structures that have the same composition, but different levels of sequence and topology information. And this is due to how these structures were solved. So if, for example, NMR is used, we have a very high degree of resolution. We have most of the information to do with a structure all the way down to mass spectrometry or HPLC and then monosaccharide composition analysis, which really only gives you the composition. Why is this important? If we're looking for a substructure, for example, the lactosamine unit that's shown on the top left-hand corner, which is represented here by a blue square and a yellow circle, visually it's very easy to see where that is in all of these cartoons. But if we look at the string format behind it, the lactosamine unit here is shown in bold. So you can see there are five different ways of actually coding that lactosamine unit. And as a computer needs some kind of string format to parse, this becomes quite a complex problem. So what we needed was a glycan representation that allowed easy substructure searching. And the solution we found was a framework called RDF, which is a resource description framework. It's a way of representing data on the internet, and it involves the use of triples, which are basically three pieces of information with a subject, a predicate, and an object. To convert our structures into this kind of framework, we needed an ontology to describe what our glycans are like. So we developed the glystream ontology, and we also developed an algorithm to translate from the glycol CT string format into this triple store type of format. We store this using software called GraphDV, and this allows us to serve a Sparkle endpoint. What this means is that we can query our glycan structures using a language called Sparkle. We're also able to do what are known as federated queries, which allows us to link to other databases to assign metadata to all of these structures and the substructures that are found in them. I'll talk a little bit about that in a little bit. If we go back to our glycol CT format, which is shown here on the top left-hand corner, the SNFG representation of that glycol CT string is shown on the bottom, and our ontology model is shown on the right. If we look at the first residue in this glycan structure, it's represented by the first two lines in the residue list, by the blue square in the cartoon, and by this residue root object in our ontology. The entire kind of linked part of the glycan is represented by the residue list and the linkage list, by the linked sequence in the cartoon, and by all of these different objects in our ontology. But by far and away, the most powerful part of this ontology is the region that describes what are known as undefined sections or fragments. Sometimes in the experimental structure determination, we find a monosaccharide that's perhaps, we know it's on the structure, but we don't know where on the structure. So in our glycol CT format, this is depicted by an undefined region, as you see there circled on top. In our cartoon, it's shown as a monosaccharide, a single monosaccharide like that, after a bracket. So in both of these representations, the undefined section is actually separated from the entire topology. Whereas our ontology, our model, our glycerin model allows for it to be represented on its own, but also linked back to the entire structure, to the main structure. And this allows us to actually search that as well. So the uses of this glycerin ontology and triple store are, first of all, in-house. We were able to use it to validate our structures that we have in Glyconect, to make sure that they make sense, that the compositions match that we have the right core types assigned to them. The next set of users would be experimentalists. I'll talk a little bit about that in a second. And also we envisaged having programmatic access. So this is served by the Sparkle endpoint. So that allows programs to access the information and query the database. So experimentalists. The thing is that if a general scientist wanted to query this database, they need to generate queries and generate queries in the Sparkle language. So this is not a trivial matter. And it also would mean that doing this manually, there is the potential to include errors. So we needed a method to automatically generate these substructure queries. So we took the algorithm that we had used in the first place to generate Lystream. We adjusted it to account for missing information. And the output is actually a Sparkle query. So we go from our glyco CT string directly to a query. How can we apply this? So if I think of an example such as COVID, COVID, the S protein of the COVID spike protein is actually a well-known glycoprotein. And it's covered with glycosylation sites which have different types of glycans on the surface. And these glycans, they actually cover 40% of the protein surface. And that means that they're really important in the immune response to such a virus. So the receptor binding domain which is found on the spike protein has many glycosylation sites. And if we wanted to, how could we look at this? So our first point of call was that we curated a number of publications on the spike protein and the glycans that are found on it. We found in our, we have a little data set now on our main site called the COVID data set. There's a number of references there and there's over 20 amino acid sites, glycosylation sites. And on these sites, we have over 200 glycans that have been found. But to look at this in a meaningful way, we decided to create patterns. So by looking at the over 200 structures, we generated five patterns that describe clusters of these structures, if you like. And these roughly are the same as the core types that you would normally find for n-link structures. So oligomanos type, hybrid type structures and complex. We take our oligomanos type structure, for example, which is shown by a cartoon on the top left-hand corner. This is the glyco CT that describes it. We used glycoQL to translate the glyco CT into sparkle and the query is actually shown on the right-hand side. So you can see it's not a trivial thing to be able to write a query like this. What this allowed us to do was to take each of these patterns, the sparkle query for each of the patterns, and query the over 200 structures that we had for COVID and generate a clustering of all of the structures. So here you can see that actually most of the structures that have been found in our data set are complex or complex soil-related complex structures. This particular graph is for the entire COVID protein. What was more interesting was actually looking at each of the sites individually. So again, you can see that most of the sites depicted on the left-hand side are complex, whereas the ones to the right are more kind of oligomanos and hybrid type structures. We went back to the actual original papers and we found that we had other things that can differentiate between these lists of structures. So the two on the left-hand side, A and B, were actually studies that were done on the entire spike protein, whereas C and D were actually just done on the receptor-binding domain, which is the domain that's found to interact with the ACE protein, the human protein. So immediately you can see that even though there's only two glycosylation sites on the RBD, they're very similar and they are complex and cialylated complex structures, the yellow and the purple depicted by yellow and purple, which is very similar to the top left-hand profile. The bottom left-hand profile was really interesting because it was an insect cell line that the recombinant protein was expressed in. And while this is often used and it can be very, very useful, we see here that it's actually, in fact, doesn't really represent the glycosylation that's found when the protein is expressed in the heck or the show cells. So the next step was to take this kind of method and apply it to a well-known list of epitopes which was published in 2009. So we were able to take these epitopes using our algorithm. We generated Sparkle queries. We used these queries to query glystream and generated a mapping between each of our epitopes and our structures. And we have saved this mapping and we use it in our conceptual map tool Octopus. So you can go there and actually search directly for any of these substructures. So the next step was to, if you have a substructure that perhaps isn't found in Octopus, we needed users to be able to search for this themselves. So we created this simple user interface where you can enter a glycoct. There are a couple of flags you can choose. And then when you press the search button, your display shows the cartoon of the structure you searched for as well as a list of structures that are found in glyconnect. And coming soon, we'll have a drawer that you can actually draw the structure in. So just to mention federated queries, these are queries where you can actually query other databases. So I just quickly want to mention that this is possible for more complex queries. However, the Sparkle query needs to be saved as a file and we use this curl command. So as an example, if I wanted to list glycans associated with immunoglobulin and this particular disease myositisus, this is what the query would look like. It queries both the glystream glycans and the glyconnect metadata. This is the curl command that you would send from your terminal and you get a result like this back in JSON format, which gives you lists of glycans and types of glycans that are found. So the next step in all of this is to actually repeat this for other epitope sets, including for example, unilectin 3D, which is over 200 ligands associated with it. Another strong reason for doing it like this is because actually in glycoinformatics, there are a number of different ontologies used. So we have glycoRDF, which describes experimental conditions. The glycoCo ontology, which is what glyconect is based on, sugarbind ontology. And using all of these ontologies, we can actually speak between them and it allows linkages between all of these various data sets and databases, which makes it very powerful. So again, we're also going to include repetitions. So to bring in structures such as gags, glycosaminoglycans, and then also link to reaction databases via kebi identifiers. So more information about both glyconect and glystream can be found at these Wiki addresses. Thank you.