 So welcome all to this SIB virtual seminar. Today we have the pleasure to have Frédéric Kizacek from the Proton Informatics Group at SIB in Geneva. Frédéric studied mathematics at the University of Pierre-Marie Curie in Paris. Then in 1984 she received a PhD in computer science Artificial Intelligence from the same university, Pierre-Marie Curie. Then the following three years she had research positions in biology lab in many countries, among them France, Japan and Australia. And she was working on managed representation and predictive methods. And then in 1999 she was hired in the Proton Systems Limited in Sydney and in 2001 in Gin Bayou SA here in Switzerland. Those two companies are a company where she led projects on knowledge management and text mining. Then in 2006 she joined the SIB, the Institute of Bioinformatics in the PIB, Proton Informatics Group, that she manages since 2008 and where she drives knowledge discovery projects. A group is involved in developing databases and software for proteomics and glycomics communities. These resources are made available through the XPC portal. The software tools support the experimental mass spectrometry data analysis, mainly for the detection of post-relational modification. And the databases store knowledge of carbohydrates attached to protein as well as protein-carbohydrate interactions. So today Feliq will share with us a sweet picture of bioinformatics. Feliq, thank you again for accepting our invitation and the floor is yours. Do you see my group? As Diana said, the group used to be Ron's group and has developed a lot of expertise in analyzing mass spectrometry data, identifying protein. And we have focused in recent years on understanding, because mass spectrometry is probably one of the rare experimental data that is used to identify post-translation modification. We have focused on this particular area, and glycosylation is of course one of them and deserves special attention. And this is what I'm going to focus on today. So part of my group will not be mentioned that much, but I will introduce my main sweetness. So Julien Marietto, Olivier Olahoeuf, Marie Grachy, David Alotti, Marcin Domegalski, and Julia Bianchi. So I'm going to have to tune you on the Glyco channel. There's a few things probably that I should insist on before we go into the harder things. So the first thing is a vocabulary. A sugar is a glycan, is a carbohydrate, or is a polysaccharide. So all these words are usually used by different communities, whether you come from chemistry or from biology, glycobiology. So please, if I use any of these words, I always mean the same thing. That is a structure that can look like this, a lot of those rings. And the depiction that you find usually in literature is those little cartoons with a meaning behind each square. So galactose is the yellow circle. The red triangle is a few codes, et cetera. So there's one to one correspondence between a monosaccharide and a symbol. And the specificity of these structures is that they have different linkages between the monomers and they are very specific and of interest. The second thing I want to say about glycobiology and glycoprotein, I'm picking just one glycoprotein out of many. And this is usually how they are depicted in Wikipedia or in any in the PDB, wherever. And in fact, this is the way we see the proteins. So it's really decorated with sugars. And you can imagine that from a functional point of view it makes a hell of a difference. So the sugars that are on this heresoprotein protein are, this is just an example set of all that have been sequenced that are there. So you can see it's a relatively complex structure, a set of complex structures, and you can see how they actually branch on the surface of the protein. So it is important to bear that in mind. In my last little detail for understanding what we're doing further down the line in this presentation, this is what is called corficulatation. That is the first glucose, our gluconac residue of an N-linked sugar. I mean, if I suppose some of you know that there are different types of linkages on the protein. There's one type which is N-linked and the other one which is O-linked. So N is on asparadine. And sometimes the first residue is recalculated. And I will explain later on why it is of importance. So now we are tuning on the channel, and I'm going to keep on going in my explanations of what we're doing. And I will just give you the two starting points that we, I mean, what we are doing now stems from two types of initiatives. The first one, glycoinformatics is actually a field that is a bit remote and not very well known in bioinformatics. And you see with these pink shapes, the main seams that people have focused on. So the 2D structure and the 3D structure of sugars. So you have a number of databases. So this is the pointer and I press here. Yes, 2D structure. So you have glycomDB, which is a 2D structure repository. Gly2CAN that was just published this year, which is another one. GlycoBase. So Unicob KB is the main initiative I will focus on because we are part of that initiative. There is also all the funny sugars that are found in bacteria and in fungus, in fungi that are stored in this database. The oldest website is Glycoscience.de, which was set up by people in Heidelberg probably 15 years ago and has a lot of chemistry on it and so on. So there's a lot of 2D structure databases. 3D a little less. So there's a bit of software and some repositories. So this is Glyco3D from Honovo. So .de of course is in Germany. GlycomDB is in the US. Gly2CAN is in Japan and this one is in Russia. So this GlyCAN set of tools is in the US. So this one in Grenoble. And so this focuses a little bit on binding that is the proteins which are called the lectins that recognize the sugar so bind to the sugars. I will go back on that as well. So you have some databases of proteins, so mainly enzymes like the Kazi database that collects, that categorizes and classifies glycosyl transferases and glycosidases and all the enzymes that actually either synthesize or destroy sugars. And so you can see that there's a variety of databases on a variety of themes. And I will also go back on sugar bind because this is one of our own. But this is really extremely unbalanced and disconnected and this is a landscape in glycoinformatics. Now in bioinformatics the lessons that we've learned is that especially from the Swiss prop and UniProt experience is that the context information is really important. And so if you take a molecular entity like a protein you have all sorts of properties that are worth considering if you want to understand the function. The other part, the other lesson is that sequences are ever present in all the environment of bioinformatics whether database and tools. And I depict the situation in this way like you have a mesh of resources that have databases that are possibly linked to experimental data that have most of the time if they're still around a user-friendly interface they have sometimes attached some prediction tools and most of the times those prediction tools have a scoring that is reliable so that you can actually evaluate the quality of the prediction. So in the proteins, so my world is the world of proteins I'm sorry for those who are focused on genes but are focused on proteins and so you have families, protein family database 3D structure, interactions, enzymes, etc. So and of course links to the genomes by definition and this defines the bioinformatics landscape and what is great about SwissProt is that it's just sitting in the middle and connecting to all these resources and this is an example we would like to follow. So indeed the idea is to say that we have our glycan our glycan have all sorts of properties the monomers can be modified they have a 3D structure they're associated with diseases they have a number of enzymes they go chasms for carbohydrate active enzymes that actually make them or destroy them they are interacting they are expressed in certain quantities so there's some binding, etc. So to understand the function of a glycan you do need properties in the same way Unfortunately the glycoinformatics landscape is not as rich as the bioinformatics landscape as I said before so we have a lot of missing links we have not always user friendly interface to say the least you have some connection between some database that should exist that do not exist so we are very much behind lagging behind in terms of wealth of resources so they are there but not well connected they are some missing one we have a lot of missing tools and a lot of missing scoring schemes for the prediction that are possible to be made so all in all this is a bit we need to do something about it but nonetheless we as an initiative so we published a view point in 2015 and 11 sorry launching the Unicard KB solution to this problem and with the same idea as SwissProt saying we should have in the middle the glycan sitting with some information that is connected to all the resources so we have to increase the connectivity and we have to create new resources and this is what we have started to do how to get there is a long way and the first thing that we need by definition are standards so there is a very recent initiative of which we are now part which is called the minimum information required for blackomics experiments probably 10 years behind all the others but still it has started and we have the support of the Palestine Institute in Germany to actually pursue this goal so the first standard for mass spectrometry experiments was published and there is an initiative for glycan array data so for all sorts of different data that are produced in glycomics and this is really ongoing and very helpful to start structuring and gathering data in glycomics. The second set of standards are actually glycoinformatics related so we published last year recommendations for using the standards that exist I mean showing that there is a range unfortunately a range of formats a range of options to describe the structures and we focused on glyco-CT which is a sort of graphic encoding of the sugars with the you have here the parts that shows that you have the residues so the monomers so you list the monomers in one part you list the linkages in another part and there is really good parsers and there's a toolbox to actually use that format so we've decided across all the databases that I've shown earlier that glyco-CT would be at least the format that would be shared amongst the different databases and as I said recommended in that publication the second effort we're making is going to the semantic web and using this new initiative to actually integrate data in a better way so this is an example of a very simple scheme that was developed to describe the structure and have everything stored in triples regarding the structure of the sugars and this is an ongoing work and it actually has an application I will show you after so this is just an example of how you encode a very part of the structure where you have a glucose and a galactose linked by an alpha-1-3 and these are the predicates that actually describe this linkage and this structure in general so all this effort has to be done within consortia we are a very small community the glycoinformatics is a very very small community so there's so much to do that it would be totally absurd to do things without connecting to one another so we are really extremely active trying to gather people in some small workshops we remodel the world of glycoinformatics and try to really do things together so that we have a better chance of coming out of isolation and total oblivion as we are at the moment and try to connect with the other initiatives in bioinformatics in proteomics proteomics is certainly our first target and we are really calling ourselves in the field of glyco-proteomics and you can see that there's really a lot of different participants and the three databases which most of these people are associated with is Unicard KB so this initiative by ShowView which is really trying to explore the connectivity of resources Unicard DB which is a data so really the experimental database with everything you can want to know on the mass spectrometry that is behind solving a structure and the sugarbind database that we develop in my group that is really looking at the recognition of sugars at the surface of protein I will go into that so that's why I'm not insisting that much the only thing I wanted to say about these three databases is that we have decided to have them developed with the same look and feel so it's all technically it's the plate framework it's Java behind and we exchange all the pieces of code that are necessary from one to the other we have developers talking to one another as much as possible and this is really going in the right direction and there is very recently we formed a consortium which is of all the developers so that we would have a dump of our code in the... I think it's a github but I'm not quite sure at the moment I have a blank but anyway the idea is that with the Japanese, with the Australian with some American groups as well we are going to share code to do further development so maybe it's the influence of Geneva but that's the way I conceive science as far as I'm concerned so now I would like to start focusing on making sense of that data and one of the things we are looking at is from the experimental side of course we are using our know-how in analyzing mass spectrometry data so over the years this allows me to mention the people who are not sweeteners who work in the group developing the MZJava library which is a Java library of code for analyzing mass spectrometry data so we have a focus in that library for annotating spectra and so on the identification side of things and as you may not know but there is no equivalence of software like mascot or any identification any search engine for proteomics in like comics there is no possibility because there is no database that is rich enough or which is complete enough and there is no template for sugars so we can't have a theoretical spectrum that is making real sense and for doing identification of sugars from mass data nonetheless we are trying to with Oliver Oliver he has come up with ideas for clustering spectra and trying to do some inference of structures using his experience in looking for PTMs so when you look for post-translational modification in the mass spectrum the idea is simply to look for a small mass shift because this is what a modification does so we are using this same type of software looking for mass shifts in data to see whether we could infer a structure monomer by monomer so generating the composition is not so difficult but generating exactly the right monomer is a challenge so he is using this technique to actually infer the structure and instead of having millions of possibilities the idea is to suggest a limited number of candidates with a scoring system and so this work is soon going to be published we are trying, Oliver, if you listen the paper so this is the part regarding the experimental data now we also have a large effort towards connecting the database and at the moment we are focusing on host-passage interaction mediated by sugars so if you imagine the cell membrane a host cell that is here and you imagine a bacterium or it can be actually a virus my example is with a virus there are a lot of surface proteins that recognize the sugar of the host and this is actually the first step in the bacterium starting the infection of the host so this of course is a business of binding that you can model with 2D or 3D in the best of cases and already this problem is complicated enough so the sugar-bind database contains information on the part so it's actually only taking let me go back whoops, here sugar-bind focuses only on that part that is it's taking the bacterium or pathogen lectin so the protein here and the part of the sugar that is actually recognized by it so in sugar-bind this is how the information is stored so here we have for instance asked to see the proteins or the information regarding this norovirus so this norovirus has a surface protein which is called VP1 and this VP1 recognizes a number of sequences that are the sugar sequences of course sorry we also use the term sequence although of course you can see it's a two-dimensional thing but there is a regular expression with brackets that is also another notation one of the many formats so if you browse over the name you can see the structure in 2D and then if you have some association associations here so whether it's disease or whether it's the affected tissue or the oven lectin so you can see whether you have some within the database some related information so the idea here I'm just going to give you an example how this can be exploited to understand these interactions so here is our VP1 so let me focus on VP1 VP1 in the database is actually cross-reference to a uniprot entry which is called VP1 and it is characterizing one of the surface proteins of this norovirus this is actually described in viral zone and you can have further information with viral zone where we have also cross-referencing as you can see here this was an achievement from a seminar I gave at Swissport just a year ago I think exactly the same day so where I suggested we should have a connection between viral zone and sugarbines so it's done and so we have this information on the lectin now let's go back where we have information on the disease and this particular substructure which is recognized by VP1 on the host we can actually focus on this one and there is a dedicated page in the database that gives all the information everything is curated so we have a PubMed link to the paper that actually describes the binding we have some information on the site that qualifies this particular substructure and we have a substructure search that allows us to connect to the full structures in Unicard KB Unicard KB being the database with all the full 2D structures and the proteins so one of the structure that is in those 34 that were found is this one so you can see this is the tip and this is the full structure so you can see also Unicard is the same sort of look and feel with the PubMed reference it's also curated databases and the unicity or uniqueness of Unicard KB is that it actually lists the proteins on which this sugar is sitting of course because this has to be found in the literature and this is a very tedious work so here if I focus here you can see that there are 2 proteins that actually carry the structure that I just shown and these 2 proteins relevant is the nuising 4 of Homo sapiens so that in the end what we can say is that there could be an interaction between the VT1 lectin that I came from the other way and this one from the glycoprotein that is carrying the sugar of which the original substructure was part of so this is now we have the means of suggesting the possible interaction between 2 proteins via a sugar and this was not possible until recently but it's just a lead to start thinking I'm certainly not suggesting that this interaction exists but that is possible to address the question so I'm sort of announcing today a new tool collection and because we have been working on a number of tools that are helpful to investigate especially what is called the glycoepitopes so these substructures that are recognised and well known have been extensively studied in the past especially because you can recognise here that maybe some of you didn't know but your blood groups are actually specified by sugars so blood group A is this one blood group B etc so these are, it's like a small alphabet of sugars of interest and so we are focusing on these sugars of interest we have now a first tool that is developed by Marie Grachie in my group, she's a master student and she's having fun trying to plot the information on all the known epitopes that we have gathered from different sources there are sources in the literature there is sugarbind which is a source there's another database which is called glycoepitope that is a source and so on and the idea is to be able to see how these epitopes are connected to one another from different perspectives whether it's because they are just different I mean they are connected to one another because they differ of one monomer then here we put some weight or the size of a load is representative of the size of the sugar but this is all exploratory so we are trying all sorts of things we are also the thickness of an edge is related to how many sources actually have identified this ligand elsewhere so the idea is to see what can we see can we observe any sort of regularity what are the clusters depending on the properties so we have now a tool to address these questions the second tool I already mentioned because it's integrated in sugarbind is the soft structure search it was actually written by Davide Alotti and this is really exploiting the RDF to its best and so he has now we have two interfaces one is with the glycosity as a query so we gathered that not so many people would be so familiar that they can write glycosity code just like that so the idea is to have the possibility of drawing a sugar structure and this is the glycan builder here which we have integrated it's a tool that has been released some years ago already a few years ago by out of a European project that was called Eurocarb and they have produced a few tools and this is one of them so we thought that we would use it and so the result that you have is that if you were to query with this structure as I showed that was drawn in glycan builder these are the results we have at the moment 17,000 structure mostly coming from our databases in BlackoDB and you can see that this structure is found in many instances of different structures so this is a useful tool now that glycopylogists and hopefully not only glycopylogists can use so regarding glycoprotein I insist again on my sugar sitting on the protein I have not yet but we are probably in a few months from now or hopefully early 2016 my other sweetener who is not in the group officially really but working with me Alessandra Gasadello and she's doing an internship and we have decided to focus on the alignment of glycosides depending on the structure so if you remember what I said at the beginning that you have the poor fucosylated and the non-fucosylated core so this is a very important property of ending sugars and we thought maybe it would be interesting to align the sites depending on the properties of the sugars and there are some predictions there is prediction software to infer the glycoside there is a very short sequence which is aspargene then anything then serine threonine so this is a very small pattern which is called the sequon and it has been known for many many years but you can imagine that this is everywhere in proteins or at least not everywhere but very common so not all of these sites are glycosylated so there has been efforts to actually gather the well known glycosylation sites and have methods like neto-glick or so some neural network you give an example or even HMM predictors you give examples of sequences that are glycosylated some that are not which you never can be sure of and then you predict the environment and in fact it is not working very well so we thought maybe there is a way of refining that type of prediction because there is a high chance that the structure would actually constrain the sequence and we see differences we see differences especially you can see some columns here are different in the in the alignment it is only an excerpt of course it is not the only there is a long list but it would have been a bit too bulky here so I didn't put it and the other thing that we did was because in some cases we have very limited amount of sequences we go in Swiss props to see whether we can find some related protein that have the same pattern from different species and see whether we can align them so we are playing with this at the moment trying to see the effect of distribution the effect of size the effect of all sorts of possibilities on the quality of the alignment with the prospect of potentially refining the prediction of dilapidation sites so we see equally if we see this is just the presence or absence of few codes so an owning site which has just one amino acid which is the modified amino acid which is a serine and you can see here that we have two different looking patterns and we are certainly not asserting it is the I mean this is reflecting reality but we are exploring hoping to find more regularities than what was found up to now so the prospect these are the three tools we actually on the same I'm happy Lou is in the audience because we are thinking of having glycodigest as far as with the web interface glycodigest is a tool that Lou developed that is theoretically testing the sugars so we know the effect of some glycosidases so we can actually give a full structure simulate the effect of the glycosidase and chop off the monomers that are affected by that glycosidase so this is also very useful for glycobiologists and so we really are gearing towards building that toolbox that is useful in the different resources together as I mentioned at the beginning so our prospect to increase connectivity to have more scoring schemes because a lot of the tools that are available in glycoinformatics are very poor in reliable scoring schemes we are also very much into developing visualization and exploration tools because at the moment making strong conclusion, reaching strong conclusion is difficult but certainly getting your mind to think of new ideas is certainly helpful with this sort of tool and in a nutshell our strategy is certainly to rely on biophematic solutions and to get the inspiration from what has worked before we want to enforce as many standards as possible or reference formats because standard maybe is a bit pretentious but trying to get the people who are collaborating in this area to choose similar formats and join forces by our means because we are as I said very few and we need a critical mass to show the proteomics and maybe one day the genomics people that sugars are important and that they play a role in biology so regarding acknowledgments as you could see I have a whole crowd of people to acknowledge there's a few names here with the closest so mainly Niki Packer with whom this project would not have come to life she's in Sydney Australia as Diana said I spent some time there so and that can be heard in my accent as well Kiyoko in Tokyo is also very active and she has just gathered, she's the real loud speaker for RDF and trying to convince everyone in Glycomics that we should go for this model and we are talking really very often with her in that respect, Matthew is working with Niki and he's the main developer of Unicard KB Nicholas Carlson started the Unicard DB so he's a mass spectrometrist and working with Miguel now Daniel is also a mass spectrometrist Pauline is one of the great figures of Glauco biology Serge Perez in the chemistry of carbohydrate I don't think anyone has experience as he is Rene Ranzinger and we also developed Mr. Glycomd etc etc so really lots of good people and my funding I am one of the privileged persons to have a bit of money from SIB to do these services so the sugar bind and other tools are actually available on ExpoZ and we had a formational grant which is over and I asked for another one we'll see and we are part of the Mariculini ITN and have money from them as well thank you very much