 Everybody, welcome to the Virtual SIP Computational Biology Seminar, sorry. Today we have the pleasure to have Philippe Le Mercier, who kindly accepted to give a talk on one of the SIB resources. So I will briefly go through his bio. So he did a study and did a master and a PhD in biology at the Institute Pasteur in Paris, where he worked on the reverse genetics of rabies virus. Then he moved on for a postdoc at the Department of Microbiology and Molecular Medicine at the University of Geneva, where he worked on the reverse genetics of another virus called SandEye. Then in 2004, he moved again and became curator at the Swiss Institute of Bioinformatics in Geneva, where he worked there for four years. And then since 2008, he's the head of the virus program at the SwissProt and UniProt KB at the same place at the Swiss Institute of Bioinformatics in Geneva. So I'm very happy you're here and the floor is yours. Thank you very much for inviting me. I'm going to present to you the web resource VALZONE from the Swiss Institute of Bioinformatics. So the main topics of this web resource is to bring together the textbook biological knowledge with a sequence from the sequence database to put all the data together and to allow to mix them. So for example, on VALZONE, you get some taxonomic index, the bread and butter of VALZONE are fact sheets. This is a Flavivirus fact sheet. And these fact sheets are linked to sequences from UniProt to some analysis tools and manufacturers. So while having a resource specifically dedicated to viruses, it's because these entities are quite diverse, very much diverse and cellular entities, and there's a lot of hidden knowledge about it. So we need people who are needing some sort of resource to reflect that. So this is the shape of human viruses that are to the scale on this slide. For the rest, they are not to the scale when they are put together. So you see Ebola virus is very quite big compared to others. You see the viral avarice which are disappear, the herpes, rabies, these are respiratory viruses and measles as well, HIV, influenza or adenovirus. And you see some hepatitis, polyvirus are quite small. So you see between a parvovirus and Ebola, it's extremely different. So their biology are different and their origins are very different. That's why we built up this resource. So this is the homepage of VALZONE in which you have a classification called the Baltimore which is a virus specific to access to different viruses. You can make search as well and you have a lot of different links. You have molecular process pages which are sort of controlled vocabulary. I will depict you later on the application cycle. So for this talk, I will focus on Ebola virus because it's quite a start these days with what's happening with this last outbreak. So for example, to access Ebola, we could access by several minutes but you can just type into the search box and it's a smart search. It proposes you different things. So with Ebola virus, you will end up with that. So it will give you access to fact sheets for either the genius, the family or the whole Baltimore. So what is Baltimore? Baltimore is a very specific classification. The reason is that for cellular organisms, you have a unique common ancestor called LUCA and you can reflect all of the cellular organisms from one ancestor. For viruses, it's not at all the case. On the tree of life, viruses would be like an IV. I mean several roots are going from branches to branches all the time and changing, jumping from others. So it's quite complicated and it's not a dichotomic history like we have for cellular organisms. And that's why we needed some kind of classification. Also, they're extremely variable in sequence. So this is just to show you, this is higher ocariot rotation rate compared to genome size, lower ocariot and procariot. And you see that viruses are going very, very deep in mutation possibilities at each replication cycle, especially RNA viruses. So the classification we have for all of them is called the Baltimore because it was named after the Nobel Prize David Baltimore. And basically it's based on the nature of nucleic acid inside the virus. So you get all sorts of possibilities. For cellular organisms, you just have double-strand DNA. But viruses, they got all possibilities. Single-strand DNA, double-strand RNA, negative-strand or positive-strand RNA and reverse transcriptase, there are two kinds of possibilities. So this is a slide depicting how these organisms are making messenger RNA. So knowing the Baltimore group tells you how the virus has to deal with molecular biology to express messenger RNA and things like that. So there are things that come on in this group but they are not necessarily of the same origin as well. So in viral zone, these buttons correspond to the Baltimore classification. And here we have negative-strand RNA viruses. And that's where Ebola virus is. Also you get influenza virus. You get rabies virus here. So on these pages I've put color code. And you see pink means that it's infect human. The yellow is for insects, green for plants and things like that. So you see this kind of virus that don't infect bacteria, for example, or archaea. It's just a carrot and even not a smaller carrot, just higher carrots. So when you click, you get into the Ebola fact sheet. So you got several things. You got access to a picture depicting the variant, the genome, different vocabulary concerning gen expression or replication, links on some information on disease, host and things like that. And access to sequences and reference trends as well. So for example, on these little tags, you can click to get access to the protein sequences. And you have two kinds of listing, either by strain. So you got here the reference train of Zaire Ebola virus. And you got all the nine genes of the virus. Here you can just look at the polymerase and you got all the polymerase created in Swiss port or the nuclear protein. And you got some tools you can align, retrieve and things like that. So you can see either by protein or by organisms. So this virus is quite small, just nine point ring frames. So it's easy to see that that way. So we have graphics which are quite popular actually depicting the variant, the shape and the dimension on the genome as well. As I told you, many viruses are very small. So it's quite easy to show the genome. If you would do that for humans, it would be nice to look at. But for this small, it's quite easy. You see that the colors are the same in the variant and in the genome. So you can see, for example, the nuclear protein is green. So it's here. You see why it is in the variant. The black protein is yellow. So it's in the surface and so on. These pictures are not just made to be beautiful, but are meant to reflect the real shape of the virus. For example, I'm using as much as possible cryo-M pictures, but the images and only the nuclear classic are not too scared because it would be too tiny to be seen. Just to show, so we have an Android of variant picture to show you the diversity. For example, you have some plant viruses. They got several capsid because they got segmented genome. So each segment is in different capsid. You have rota viruses, for example, which are infecting your gastrointestinal tract. And you see that there are three layers of capsid, which makes them quite resistant to your digestive system because there are three layers to get off because you can inactivate. You have M13, much used in laboratory, which is a completely filamentous virus. And many others. Again, some others. This is archaea variant, quite special. You don't know much about them. This is papioma, once making warts in your skin. The herpes viruses. And these are the common bacterial viruses, the Codovirals, of three kinds of them. The Bacillovirals, which I choose in laboratories as well. And this is the Mimivirals. So I told you it's not too scale because the Mimivirals were very big otherwise. So, beside this graphical information, we have more interesting information in the form of control vocabulary and everything which is specific to virus biology. Because we consider that the cellular biology, the common biology, the ribosome translation would be known. And we just focus on things that are specific in terms of this virus. That a common cellular biology will not be able to know. So, in this text, we have gene expression, there are links to control vocabulary pages on which you get some pictures depicting what's happening, some texts, links to publications, and sometimes a listing of all different viruses doing the same thing. So, for example, this is influenza. You can see how it creates polyA by special ways, special VR ways, how it's stealing the cap of messenger RNA. So, the cells are really able to define themselves by finding that the cap is not of cellular region. So, influenza is stealing cellular caps. So, it's quite fine. So, I bronze to the different kind of vocabulary we have. We have vocabulary special for entry. So, all the different ways of virus can use to entry in all their cells. So, basically, they enter by the trigger, what we call an ITMI signal. Actually, they don't really actively able to enter a cell because they don't have enzymes or what's static. So, they manage to bind something on the cell that trigger ITMI signal on the cell, actually, then do the job of eating the virus. They just do nothing. Just have to cross the membrane after all that the only function it has. Then, the control vocabulary corresponding to replication, transcription, and translation. So, you see cellular process are using mainly bidirectional replication and the virus are using because of the nature of the genome, wide variety of different replication mechanisms, as well as transcription and translation that are really fooling the ribosome in many different ways. So, they are very small. So, they are able to make the most of their coding sequence by fooling the ribosome, by making frame shift, backward, editing, whatever you imagine. And so, they got many, they are able to get many proteins from one gene with special tricks. So, the exit pathways. So, the worst one would be the lysis. That's where the virus is actually hurting the host because, like influenza, often lighting your cells. That's what makes you sick. Some are using exocytosis or more simple or other are using cell to cell so they don't get out actually. It's mainly fungus and plants. So, how do we really link that knowledge to the sequence data? Because this is human views of knowledge, the virtual fact sheets. So, we're using control back library. So, here, with Yershikov. And for that, we used three kind of control vocabulary that are linked together. So, these are the virtual fact sheets that I show you which is kind of contextualization and with the virtual vocabulary. So, these vocabulary is completely linked to uniprot keywords. So, we have the same concepts in uniprot keywords and in Go terms. And all of those are linked together. So, it means we have like 200 concepts that are linked by a valve zone. So, you can go, if you get this concept in Go, you can go back into valve zone and then in uniprot or whatever because they are linked together. So, the way we are curating that is we are reading the literature. It's manual curation. We are using uniprot editor to put the keywords in uniprot. The Go annotation is either made manually or if we put a keyword which is linked to Go, automatically the Go will be there but it will be an automatic link that would be no publication associated with that. And we, of course, make valve zone manually. So, we do all together. This is an example of the control vocabulary which is linked. So, you see under the boxes it's not very seen but you have the keyword uniprot, the Go term and the valve zone page ID. So, for many of them you got the three. For some of them you don't have, for example, a Go term because it's too much in detail and it's not the purpose of Go to go that way or different things but mainly all of them get three addresses. The way it works is that so in valve zone you get an interaction menu, for example you can get details and here you have links to all entries related to that uniprot through the keyword or to all entries related to the Go term. Which are the same list actually but it's quite different views but they are the same list of proteins because there are links together and you can close from Go to valve zone to uniprot as well. We have a sidebar as well which gives a lot of small information like links to some valve-specific pages or resources taxonomy we have reference trains and we depict a little bit more after some information on host and disease. So, reference trains are reference proteomes now they are because we got a lot of valve entries and there's complete genome in the database because valves are small so it's easy to get that or for example in influenza we have like 40,000, 50,000 entries getting in so the question a user is if I were to look at one one annotation which one should I get so this is a purpose of reference proteome, this is a golden standard we put all the data the Go term, the keywords and the papers we try to put everything in that sequence so users can get access to it otherwise it's split between many many viruses for example, hepatitis C we are unable to make cell culture of hepatitis C so we just have samples taken from human cases it means that every laboratory of the world have a different hepatitis B sorry, hepatitis B when you try to put the paper to each train so you get hundreds of trains or each one shows a little bit of information for a user to get it back it would mean they have to go all through them so we put everything in a reference train and the fact that it's not this train that showed up it was showed up, it's indicated in a way but in one place you get most of the knowledge of hepatitis B it's still a true region yes no, no yes, it's true entry in which we put every knowledge and we still trade that back to the original train of course, meaning that we think that the function is shared otherwise we are going to do that so for example, in influenza you get 400,000 entries right now some of them are reviewed and the golden standard only 13 restraint entry of Puerto Rico the fun about influenza that you see it's quite small there are only 8 segments of influenza encoding for 12 to 14 proteins so you see that the ribosome alternative splicing are doing a lot of interplay to make many different proteins for one gene and the fun that influenza has been studying molecular things more than 30 years and it's only 8 segments and we keep discovering new operating frame I mean in the 12 14 you have 4 proteins that have been discovered in the last 3 years so good luck with the human genome because it's very small and we keep getting new things which tend to be important for most of them so in the end of the day we got 401 we didn't manage to make 400 reference proteome that are supposed to represent all known viruses I said known viruses, I mean the one which have been classified and studied of course it's just a part of the viosphere which seems to be quite big and it comprise 16 proteins so it's most of the diversity it reflects most of the diversity of non virology today when we have more than 2 million operating frame seconds in the databases so this is what called the viosphere, it's also a multi more classification with small graphics of different viruses we have some replication cycle for few viruses it takes a lot of time to do that for example this is the one for Ebola, so again we linked to the control vocabulary and we get some pop up and things like that so Ebola it's quite simple, it triggers it means signal by micro pinocytosis because it's quite big so it's not going to be able to enter an endozoom or things like that so the cell actually captured it and put it inside then the virus is making a fusion where the only activity it has is the fusion which is quite standard in various envelope viruses and then it can start directly in the cytoplasm it's replication because it's all RNA it's RNA business so it can go in most cells it's working quite fine all the cells are making RNAs pretty much so they got ribonucleotides as much as they can so they just replicate, they make a few blocking of anti-viral systems interference in the RNA sensor and then they are able to get out and to finalize the cell at the end so this is very important the way the virus interacts with the host because it really determines the success of the host in the cooperation between the two organisms into the interaction and actually the modern cells have a very big anti-viral defenses very heavy and for a virus to get into human cells for example you need a handful of keys to get down all this anti-viral system and to enter specifically on things like that so it's a race of arms between the cell and the viruses so for example for influenza virus you know that human influenza are not like avian influenza are not really susceptible to avian influenza and for several reasons we have differences inside the receptors in mammals compared to avian and in human a few million years ago we lost a gene encoding for a neuraminic acid but it just lost in human and not in other mammals and it seems to prevent most avian influenza again to enter and the ferrets have lost the same gene so that's why ferrets are good model for a human influenza is just for human so you see we lost this gene it means that we are much less susceptible to zoonotic influenza which is quite good for species but the virus managed to create a special human strain so it's still a bit better I think because you are susceptible to human strain but not all the circulating strains but you cannot really silence the virus they managed to evolve very quickly so there are a lot of differences and the virus has a lot of counter differences just to show the difference this is a HIV replication cycle which is more studied actually much more than Ebola to study Ebola you have to work on a P4 it's quite easy and difficult to do that because it's very dangerous viruses when for HIV you can work in a P3 and a million of dollars on HIV so we have much more knowledge but you see it's quite different on this one you have to enter into the nucleus of the cell so that's why not all the cells are able to be susceptible because you need to cross this nuclear pore on some cells it's not allowed for the virus to do that so it has a lot of control measures to be able to do this stuff and then you have to get out as well so there are quite different things for HIV and it's dedicated to T lymphocytes so we have a lot of keys to enter T lymphocytes but not other tissues so we arrive at the end so in Valzone besides all the system we have also developed a few e-learning we have some phylogenetics for last multiple alignment for example for last multiple alignment we have 100 pages and this course have been created in collaboration with FAO and EAAE and focused on animal viruses so we arrive at the end so the statistics of Valzone we have we have 407 various pages fact sheets and about 100 Valfamilies more than 2,000 static web pages because you have 4 static web pages for each it's genius on all the control vocabulary and things like that a lot of unique pictures the number of visits is increasing every year we still did not eat the roof it will happen some day but we are at 26,000 visit per month and a lot of page views for many countries but it's many like 37% from the US and the second is like 6% so it really a lot of virologists must be in the US I don't know so thank you for your attention this is the Valzone team so thanks to the curation team Edward Castro is making all the programation stuff and Lydie Bougalre is the Swiss Port operation director and he is the group leader of Swiss Port