 Good morning everyone. My name is Babika and today I'm going to talk a bit about COVID-19 viral beacon, which we have developed here in our group at European genome phenomarki EGA in Barcelona. I would like to thank the organizers, the galaxy team to, to give us this opportunity to tell you a bit about viral beacon and our my talk will fit with the visualization part of the of the SARS-CoV-2 data analysis and monitoring system. So COVID-19 viral beacon was developed to share the genomic variability of the virus genome genomic variability as you can see the mutations the variations at each position so it was important to know where this virus is mutating and you know at what frequency and etc etc so basically the idea was to develop a platform for a quick search you can do it you know anywhere. Given the urgency of the system we wanted to develop a platform that you can quickly take a look from your mobile phones for example so so it was basically the primary motivation was to make it a very simple system where you can just query you know nucleotide variants or amino acid variants. Along with some metadata involved for so for example which country these variants has been more found or you know what is associated metadata. What are the unique cases and also to include all sort of data so Illumina not Nanopore because then what we were observing more was the data that we were getting was consensus data. So we wanted to look for example in more granularity to get an idea of of inter host variation for example. So this is the COVID-19 monitoring plan that we are working in collaboration with Galaxy and ENA. So that we get the data access from ENA then which is later processed by Galaxy team for the Illumina data for example or we are processing Nanopore data ourselves in collaboration with the bio code team in CRG ETA. And then this data is then once analyzed is deposited in our FTP server where why are we going to platform fetch this data for the visualization purposes but also you can directly download this data through our FTP servers. Later you can do all sort of you know analysis that I'm sure later in this this conference will be covering. And so this is the basic idea the basic collaboration that we are we are working on. But before starting a bit about viral beacon I wanted to talk about beacon human beacon that is this in the idea was based so what is human beacon. Human beacon or in general beacon is a gfo dh initiative for the federal discovery of genomics data in biomedical research and clinical applications. This project was born to me to maintain the privacy of human genomics data while at the same time sharing this data, given all the privacy and you know ethical concerns in mind. So the basic idea that was born for beacon was that we don't want to know anything about you know metadata identity of the patient etc etc you just tell us at this position. Do you see a variation or a mutation so for example at a position X there is a, you know, do you see a C or T at that position and then the simple beacon design was to just get the response yes and no. So this was, you know one way to quick quick search for the variations in human genomics without jeopardizing the privacy of the of the data. So that was the design of beacon version one and right now we are developing a version two. Again, it is leading this beacon version two and what we are trying to do here is to make it a bit wider than before so for example in the previous version you could only ask for certain mutation and you would get the response as in yes and no, but for with version two what we are trying to achieve here is to get much more information, for example, given that the mutation that you see at that position could you give me a bit more information about it like what kind of diseases this mutation has been seen, or you know which which group is holding this kind of data so how can I apply for permission for to access this kind of data set for example, or other meta data set with that I can access about about this mutation. So, so one way how beacon works is basically you have your genomics data set you can be an institute or a clinical repository or, you know, a private entity doesn't matter. You can have your genomics data set and what you do is to put beacon on top of it, you know, and however you are storing the data it is a respective of what what way you store the data. You put a beacon on top of it it's like we call it beaconizing the entity, you know, in this case hospitals, and then what we can does is it ask for queries or you know it generates this response. That are that are query user can ask and then it collects aggregates all the responses from different entities, and then it shows you through beacon network. It gives you back what you are looking for what kind of questions that you're asking. You can ask, you know, different query types that now we could be conversion to can handle so everything about you know make variant annotations you know clean wire IDs, etc, etc. There's some information about the individuals like you know what kind of sex or age or what kind of diseases this variant has been seen, etc. More about biosamples, you know, and so on and so forth. So that was the idea that we were already working on and then this water beacon came into the picture because I want to tell a little bit of story that when we were working on this this was the scenario where we work basically during the lockdown. Our institute is based on next to the hospital and you can see in the picture second picture that these makeshift tents, you know temporary tents were were put there in front of our office as a as a part of extension of the hospital that we work next to. So this was the scenario and then we thought that let's see what we can do with the with the SARS code to and basically it started with this very, you know, basic planning and despite us not having any expertise in the in the virus genomics what we tried to do is to set up a program where a team like galaxy can can you know collaborate and share their data and we can hold something for visualization and also be on our side as well we started analyzing some data or making collaboration so so many good things happened during this time. And what I want to say is like finally what we are ending up to is that we are, we have this platform for data visualization, but also to share this data that I will talk later, you know, our TP server that you can just go and download it. And, you know, play with the data for your, your own curiosity or if we are, you know, wireless genomics expert, you can contact us we can help you in whichever way we can. The basic schema of beacon viral beacon is that basically the data we is downloaded from ENA, whatever is available for SARS code to then we filter it based on human host and apply other filters as well. And then the metadata is sent to the metadata filtering and harmonization team. While the raw data the genomics data is go is divided into you know Illumina data that galaxies analyzing Oxford nano data that bio code team at CRT is analyzing in collaboration with us and consensus data we are using micro GMT pipeline. The data is then, you know, later annotated, you know, as well as it is converted into beacon friendly format and then we apply beacon on on top of this data set, and then this is then imported to platform by the beacon platform for the visualization. The data is imported also to J browse for for, you know, visualization of the BAM files, etc. And also on FTP servers for further download. The selection criteria is that we are applying right now is that it has to be SARS code to because there are many different SARS code for variations there in the, in the latest base. Post-exon is human because also these experiments have been done, you know, on mouse model, etc. So we are specific for only post-exon human target sequences whole genome sequencing. The sample source is natural or clinical in origin and sequencing technology but we are restricted to Illumina or Oxford, Singapore. I wanted to talk a bit also about data harmonization which is a very lengthy and tiresome process, although it doesn't look, you know, from the outside it doesn't look so so it looks trivial, but it was not because usually this data comes as a free text. You know, I mean, it's just that it's you can see, for example, this COVID-19 that is, you know, get written in different ways. There is also COVID-22 or COVID-23. Yeah, and likewise, you know, the free text, so then we have to look into, you know, what are the variations of the text and if it is clear to us what the submitter intended to write, we can we convert all of this into one tag. So in this case it will be COVID-19. And likewise for, you know, swabs. So now so far in jail and it is the same things has been could be written in different ways. So, so what we try to do is to optimize it into the nearest tag. And so the story doesn't end here. We see the same for, you know, dates, country, host sex, host disease, et cetera, et cetera. So what we try to do is to optimize that. And if it is not clear to us, we remove it from the from the sample. And the available data right now in vital beacon is around more than 100,000. And we are, we are working on more of it. So this is what the data that you will see for visualization on the vital beacon. But on our FTP side, it holds much more data than that. A quick, a quick summary of what you're going to find on on the webpage. So there is the search for variant so you can write your variant what you're looking for. You can play with the information tax or they are different way of entering you can only search, you know, all variants at this position or you can search for specific variants, for example, show me at this position only those that are that is C and changing to T, et cetera. So it gives you some information about, about, you know, the metadata where this data is coming from what was the frequency in, in different platforms, sequencing platform, for example. Here you see in a Illumina, in a own ent and the same variant frequency in a consensus data. So how it varies in different data set these data has been normalized platform specific and and it said, yeah, other information. So there are different ways to query SNP query region query feature queries, motive queries and amino acid queries that you can play around. And one example of region query for example is that you can see, you know, different frequencies of the variants on the given position or the range of the position. Likewise we have motive search if you want to see a short repetitive gamers. If you want to see where they have appeared and if, if they have been mutated and how this is one way of finding through motive query search. So the data the also you can find the summary of the metadata for because now I mean data is getting more and more so to make it, you know compact for visualization what we put in some plots to for quickly quick analysis of the of the metadata. And let me just go on quickly now to, to web browser for having a quick demo session. So this is a homepage or vital beacon some some information, very, you know, quick visualization of most mutated positions. So you see here highly mutated and different colors show different data platform and some fly statistics on the data, you know, that just to give summary where the location where the be where is the most data coming from for example UK here we see that also the data that we are seeing in what a beacon has come from Cog UK project, quite big project and the data was publicly shared so we therefore we have a lot more data from there than other countries. And then we have, you know, variant search page so to as I explained before to search for the variants that you are interested in. Like I said, for example, there are different ways to make query. So there is a position specific and variant specific but also there is you know all you can just look for certain, you know, SNP or all sort of variants that you have in this position, insertions or deletions as well. So, so please feel free to play. And yeah, further you can do many other filters you can apply to restrict your searches. For example, what sample source, the origin of you know the sample type, the platform, host sex, host age, range of age, etc, etc. And yeah, likewise. Another one would be, for example, so you can go to a multi variant query so here, this was developed because we started seeing that you know this alpha variant and delta variants etc they started coming as you know is the co-occurring variants. So if somebody wants to search, you know, show me variants that are offering this, this, this together so all the samples with these variants so that you can make a query here so this is option. For example, for most common, most common variants that you can see and we found in literature. So you can make a query and you can see how many samples are, you know, what country these these lineages we see. We have, so this was multi variant I already explained the region query. So this region query again you enter your range of positions that you want to see and and basically from there you can you can get some information and frequency plots. All of these have frequency tracking plots so it takes a bit of time to load, but have some patience and you can see in real time, you know the frequency of certain variants how they were shown, or they are going up or down, etc. Likewise there's motive search for example you write a gamer, you know that you're interested in and at the background it's using Fimo and it's making a search for the gamer and then you can click basically which motive are interested which positions and then how it has been changed if there was a change on the, you know, on the what else. Yes, we also have jbrows instance so jbrows is a genome browser here that is that you can find for looking deeper into the data so then back and directly into the weeds and then files to be precise at your findings. And this has been arranged, you know, with different countries so to make things easier. So, so that was a quick introduction, please feel free to browse the viral beacon for more data visualization. Okay. So, yeah, so the data monitoring plan basically that I was saying before, where we have come from here. So, right now, we can go with 19 monitoring plan holds around more than 700,000 files around 15 TV of data on our FTP server free to download open, you know, for for exploration purposes for research purposes. And this is the site for for the data download. The projection in future for the viral beacon project is that there is a lot means a lot more data in public repositories that we that have not been analyzed yet, but it will eventually end up in these FTP server. And this is in process so we are continuously working with you know with the bio code team for the nano code data analysis or with galaxy teams, and also not only what is on public servers we also galaxy is analyzing data from you know country specific country specific data that that also viral beacon is receiving. So in future there's a lot of lot more data that is yet to come come. So what we can do with these data so another question that is, you know we are holding this data for research research purposes but how it can be utilized. So, of course, we are not viral experts here it's something that you know something that we read and we try to tell others to utilize, especially the viral genome expert to use this data set for good research purposes. So, for example, one very common question that one could ask us to show the low frequency intra host variations. That is consistent across trains, you know, so the low frequency, usually the low frequency variants are not detected in DC adding other or in other databases if you see that this low frequency variants will not even appear you know unless they are present in at 10% of the of the samples they are usually must. So this is one good way to ask for the, you know, for the low frequency variants to see if how they are, you know, making their way out so the appearance of, you know, delta variants etc how they are, how it started. So to trace that kind of frequency, for example, that kind of story. Another could be that I was saying, saying before intra host variability so so what I was mentioning consensus variants you see that is more like when once it crosses above certain three shoulder only those are the ones that are recorded and that's what you see in most publicly available data platform. But what about these variants, they are in very low frequency but but they are, you know, coming up there so this is one plot to show exactly that. This is the Illumina data for the same variant. And you see the how it started showing up in very in, you know, in, in the Illumina data, while in consensus data it wasn't appearing in any of the platforms until it started eventually you know, in October, for example, this is one example of farmed mink variant, why 453 f that have started appearing in humans so it was already appearing long before than what we started observing in the consensus data set in around October we started observing that, but we were already observing this variant in Illumina data set. Likewise, you know what other other analysis, for example, synonymous versus non synonymous mutations. You know here this plot is showing that most of these intra host variations at low frequency that we are observing were actually non synonymous mutations. So somebody might want to look at the into into that. Real time tracing of all variants evolution so this is the variant evolution plot that also viral beacon provides that you see at the below on your screen. So it's like real time tracing since the data arrived to viral beacon how this this variant has been evolving. Okay, so that will be from my side for more information on viral beacon please visit the pipeline page that is on the website to see what kind of tools and pipelines we are using. So these are the statistics of the data that we that we receive for the usage of viral beacon. It's old statistics that needs to be renewed but we see monthly users between 500 to 600, you know, increasing our servers. And, and I want to say a big thank you to the team, the collaborators, my colleagues who were part of this and this platform development. Thank you to galaxy team for being persistent and collaborating, you know, with us. And I think it's because it just we felt motivated that you know this is, this is something that we have developed that is being that is becoming useful for the community. Thank you to the galaxy team. Thank you to the bio core team at CRG for helping us with the data analysis bias. Please write to me or my team about any queries about viral beacon any suggestions any, you know, problems with the data set et cetera please reach out to me or my team. And we will be happy to hear from you. Thank you. My name is David and it was a pleasure to to give a talk about what we can. Thank you.