 Okay, so welcome all to this last SIV Virtual Computational Biogesterminus series. Today we have the pleasure to have Dianis Zonaios from the Vital License Risk Proc Group, two competitive centres of the SIV, the Supercensus of Bioinformatics. So, Dianis got his PhD in immunology at the Ludwig Institute of Cancer Research and the Institute of Biochemistry at the University of Lausanne. After working on the Rural-Virus-specific B-cells in Mars-Mamaria tumor-virus infection, for respect of post-doctoral training he joined the lab of David Eisenberg at the University of California in Los Angeles, where he worked on the development and implementation of the database of interacting proteins. In 2002, Dianis came back to Europe and in the following years he led several groups in the Maxeronopharmaceutical Research Institute, centre on computational methodology development in the field of proteomics, genomics and genetics. And since 2007 he leads the Vital IT Group, which is a multidisciplinary team of scientists and technical staff maintaining a competency centre in bioinformatics and computational biology. The co-facility enables scientists to access state-of-the-art computational infrastructure, as well as expertise in data analysis and algorithmic development. And Vital IT also provides a web server called XPC, offering access to a range of database, software, tools and services. And also since 2009, Dianis is also the director of SWISPROT, a C-group which developed, annotates and maintains the Uniprot, KB, SWISPROT protein sequence database, the most widely used protein information resource in the world. And it also, the group also develops and maintains all the widely used resources, such as pro-site, enzyme, HAPMAP, RIA and SWISLIPIDS. And the group also co-heads development and maintenance of the XPC proteomics website. So finally, Dianis, Xenarius is also a unique and unique full professor. And today he will share with us his vision on the evolution of these two competencies centre, Vital IT and SWISPROT. And I hope you have enough time to be here for a few hours. So Dianis, thank you for accepting to give a talk in this series. And the floor is yours. Thank you, Dianis. So it's true because I have 156 slides. So Fidel Castro died a few weeks ago. He was used to run talks for four hours. I tried to stick to the one hour. But this is actually a really hard and difficult challenge. As you've seen, there are two groups whose mission is to provide service and support research in the Atlantic area, but also beyond, even in the eastern part of Switzerland and in the European framework as well as worldwide. And it's extremely hard to summarize all the activities that goes between these two groups. So I chose to give you a bit of a snapshot. So I'll hop from one story to another, which is definitely not why my former professor David Eisenberg told me he said you should only stick to one subject and never go to more than one subject in one topic. So I definitely will not forward and follow his advice. But I think it would be actually kind of bad to only talk about one activity. When actually you should see at the end of the talk that these two centres create a unique resource and a unique set of people that enables researchers here in Europe and worldwide to the research. And I talk about the evolution because honestly, some 10 years ago when I inherited vital IT, there was only 10 people. We are roughly now around 80 people in the vital IT competence centre from different traits of life from different domains. But we have also witnessed over the years a real change in the way we do biology. From the early years of 1912, and this is actually a slide borrowed from Ron Appel when he gives a talk to explain what is bioinformatics, bioinformatics is the art to pray in front of a computer to transform the data into a knowledge. And fundamentally, that's what we've been trying to do since the inception of the SIB, since the inception of bioinformatics and since the inception of computational biology. We are also having a real transformation in the way science and technology is evolving and if you follow the Twitter and all the blogosphere, you see ads about great technologies such as IBM Watson, that's the stuff that won the JR party, that is now applied more and more to the life science and medical science. And the artificial intelligence, which for the old guys like Ron and others are actually neural network support vector machine and things that exist for the last 30 or 40 years are now getting back again a lot of energy. And this is actually what people wish to provide a solution. We are really in between because we don't sell technologies, we are not a computer science department, we are enabling some of the computer science applications, we are developing our own, but this is actually the type of environment in which we are living. Now in terms of competency centers we have been doing some years ago kind of a very risky business. The risky business was to bring together two separate groups, one dealing with the Swiss products knowledge base, which is now part of Uniprot and I'll talk about this in the first part of the talk and the other part being the vital IT center, which was a very small center that started in 2004. We are 12 years down the line, vitality has grown up and has a bigger size than the Swiss product was, that is now. The thing which is also important is that we know that most of the time people are very specialized in their own domain. By bringing these two groups together, we kind of glue together these two separate activities, the knowledge part with the infrastructure and the algorithmic expertise. And we bring in research groups that are interested to use part of these expertise and I'll demonstrate a few examples in there. There are also a lot of activities that goes into training. Most of the analysis nowadays that the new bioinformatics generation is doing deals with more than just one sample, one gene. They handle terabytes of data. This is actually something where you can't just do that on your laptop, although your laptop is big, you have to do that with mainframe supercomputing and infrastructure that are designated for this. And that's where the vitality infrastructure is critical. Now, let me first go to the Swiss product activity. And this is actually an activity that is led by these four people, the operation director, the head of curation, the head of automation, the head of development. You can imagine that I would not be able to run such kind of endeavor which is maintaining a knowledge base that is used by more than 900,000 people every month alone. This is not the work of a single KGE or postdoc. And I'm very well appreciative of their work and the effort, both of the head of department as well as of all the people that work in Swiss products. The curation, which is actually the way to transform, synthesize and structure the hidden knowledge into an ontology of vocabularies that allows us to search it back. You all use Google. And you ask Google things. At one stage there was an ask-jeeves where you can ask a question and it was a returning answer or a life course for the older person. This is actually in part due to the data mining technologies that Google has. But the problem is that for a computer knowing what is an evidence or a tag or a word versus what is the meaning of the word, despite the advance in artificial intelligence, it is still hard. And being able to structure this and having people whose dedication and work is to maintain this is critical. Because most of the way we publish currently is through peer review journals. You write an article, we treat it about it, we publish it, that goes on DCV, that increase our age index, and we go on with our stuff. But this publication is then reused by other people. And having access to this, besides finding the paper, but also understanding what it contains is the work of the curator. Now the Swiss Prod Group has been creating the Swiss Prod database some 30 years ago. This year we had the 30th anniversary. But the Swiss Prod Group has more than just the name of a database. It is a group of curators. And that's why I call that an excellence and competence center. Because the competence center means you have at one place a set of expertise that you want to nurture and maintain. And it's an excellence because there's no other team worldwide of that size funded by not-for-profit organizations. There's a lot of others that are for-profit that maintains a bio-curation level and keep up a high level of quality over the years. And the type of project the Swiss Prod Group contributes to, one of them is Uniprot and I'll talk a bit in a moment, and several others. And you will see that basically all of these takes advantage of the excellence and the competence that these people have. Now the Uniprot website and the Uniprot resource is to provide a scientific community with a comprehensive, high quality, freely available, accessible resource of protein sequence and functional information. So the Uniprot knowledge base contains in the Swiss Prod section about half a million entries. They are curated for about a third 300,000 reference and they are requested every month of about 900,000 users that come and ask questions to that resource. It is reused and linked to about 150 other resources. So if you do one error, you provide this elsewhere. Now the Uniprot itself is therefore a resource and this is not done only by us. It's done by several group of people led by Alex Bateman at the EBI, the European Bioinformatics Institute, Kathy Wu in the US, in Delaware and PIR in Georgetown and myself at the SIV. And the color coding pretty much designed where people are coming from. So in order to maintain such kind of resource, this is not done by a small set of people. Also because about, let's put it this way, a million people use it on a monthly basis. Now what it does contain is information about the function of a protein. It contains our wealth, all the structured knowledge that is associated to that function with the original paper so that people do not have to skim through thousands of paper. If you look at P53, there are thousands of papers that describe P53. The other type of things, and we are in a world where people sequence genome by the days, we've been for the last 30 years characterizing mutation and we had only a handful set of mutation in a given disease. There's a lot of them that are linked to a disease, but very little, this 7,000, for which we have a real functional impact so that means the function of the protein is disrupted. The vast majority, we know this link, but we don't know how and we don't know what it does, at least in our resource. The interesting part, if you look at where the variation are located along the protein, and this is actually the protein sequence where the different domains, so you have a kinase domain and an ITP-baiting domain, you could see actually that some of the mutation lies into domains that have a given function. Now if you sequence and ask for all the mutation that exists, this is what you see. There is a drastic difference between what we know that will impact the function and what exists as a variation on that protein. So when someone says I found the function and I have a prediction of what is the impact of it, we have always to take a train of thought and we have always to come back to what we know. This is what we know, this is what we see. Now one thing that we've been always asked is there is too many papers coming and being published. This is actually the curve that shows the number of publication that is coming up and the number of paper that are curated. Remember, about 300,000. There we are in about a million, several millions per year. So this is not sustainable at all, say the people. We should stop actually doing curation because we only should resort to text mining and artificial intelligence. And I think we should to a certain extent but we should also try to balance that and I'll show you an example of the work that we have been doing here. It's also true because in order to go from the papers to a structured knowledge, we can at maximum produce over a year and insert 8,000 articles that have value about the protein function into the knowledge base. So it's not hundreds of thousands, it's eight thousand. Now there is a paper that has been just published in BioAxiv that you might want to read which is actually trying to tackle the scalability and sustainability of curation. This is actually something that we've been asked by our funders, the NIH and several others to actually look at how sustainable is curation. And this is a rather important one because through this we could demonstrate one very important thing and this is actually represented here in this image. The expert curation, select, and those are the categories and let me walk you through this very quickly. We have three type of approach. The first one is the expert curation. Those are up in your curator at SwissProt. The other is trying to monitor the table of content of the journals. And the third one is trying to set and select a random sample from one date to another and see how many papers are truly adding or being selected for curation. Those ones would be actually going into the curatable part, would be going inside SwissProt. And you could see by the thickness of the line the vast majority are actually coming from the expert curation. If we were doing a random sampling or even a journal monitoring, we would not actually select them. But what is most important is that a large fraction of them are out of scope, at least for UniProt and for SwissProt curation. That means you can have a very large amount of paper being published, only a small fraction are curatable and have truly an information. The others might be redundant. This is actually the redundant part which the expert curation detects very well. Insufficient evidence, just for which you would not trust completely the results. This goes into the reproducible science type of discussion that people have. A lot of them are re-eucumentary. We don't take re-eucumentary. We take the original data. So all these sections from the redundancy down to here are out of scope. But we do curate a large fraction of the paper that comes in. So that means we are sustainable from the way we select and we use our neural network which is our brain and our knowledge to select the right paper. We've been actually optimizing this and that's actually the paper I described here has been done with a group from Xiong Lu at the NCBI and the National Library of Medicine because they will piggyback on the shoulder of a giant which is the NCBI. We go and go to PubMed most of the time to find a paper and we actually piggyback on their technologies to select and identify the best way forward so that we optimize our process. This is actually quite important because fundamentally it means that we are completely sustainable from our process. So please do read the paper. It's open, bioactive and we'll see when it's going to go into a tabloid hopefully at one stage. At least it's good for consumption. Now, let me go through very quickly other type of resource like ViralZone. Here, the idea is not to actually create a new resource that takes care of viruses to piggyback on the Uniprot resource and create a view for a virologist and create a knowledge base that's for that structure this information for different type of people. The virus are actually kind of a well-structured set of beasts but the more we sequence, the more we discover virus for which we have no clue where they are ending and people doing metagenomics nowadays see that they have a lot of phages and other things but they don't know where to place into these categories. ViralZone is therefore a resource that is useful for structuring that. It's also used in our effort with the Food and Agriculture Organization and I've talked about this a few years ago to actually set the reference for the FAO activity. Other type of activities are metabolic networks or lipidomics and metabolomics we don't know talk about the RIA activity because it's actually kind of a by-product of what we do when we do an enzyme annotation in SwissProt. When we do an annotation of a protein and it's an enzyme, we do a chemistry part as well. Now, if you were asking that ten years ago SwissProt was not doing chemistry since so strict they were using names of chemicals not chemistry. We have turned over the last years, over the last eight years set of biochemical reaction. It's not a lot. It doesn't compromise all the chemical reactions and most of the time we don't know about this. But this is actually the only curated record where we have the chemistry where a chemist or a chemo-informatician could use and take that resource and do his Q-SAR, do his metabolic engineering and all these kinds of things. This is actually the type of thing one of them was a System X project that Marco Pagni led for several years and Lidie Bourguery in SwissProt has led from a System X Achiev project called Metabolic Scale Models, Metanet X and in this case, this kind of reaction has been used as a foundation to create this Metabolic Scale Models. This is rather important if you have a trusted set of reactions. Now, all this type of activities of the SwissProt groups produce by-products of resources. It's still the same people but their expertise is brought to friction for certain type of resources. So the expert curation is a true added value. It's a true competency. So when people need to have expertise or access to an expertise, they come to the SwissProt group. Now, if I come back again to this image, now what I want to go through two examples is how we try to actually merge the SwissProt and the other activities. Now, we got contacted several years ago by the EMBO Foundation which is the European Molecular Biology Organization which is a not-for-profit organization of the MVL and they published journals and they are a publisher of journals and they were actually having a big of an issue when they were editing articles. And one of the problems we have with articles is that we have a part of the articles that is the data. Those are the figures that each of you, when you prepare your paper, painstakingly piece together to tell a story and the narrative which is things that Google PubMed could find. But the figures and the content of the figures are really hard to identify. So with them what we've been doing, we've been trying to actually collect and characterize and that's where SwissProt has been rather critical because we avoided a lot of pitfalls in the way we designed the system. To characterize genes, small molecules, proteins, cellulose components so that we could do search. So if I take an example of that figure, you will see three components in there. You see a component which is things that are measured or injected like the insulin for which we have an identifiers, the gene for which we use it for a transgenic mice and some kind of measure which is in this case the glucose. You can see that here we are putting the identifiers behind so that basically the computer will be unambiguously or ambiguously if we ask, identify who's who. This is actually important that you could actually do this for every single figure and tag each of these elements. To finally have a view from that figure about what has been the experimental variable that have been used to intervene on the system, what have been the assayed component and what has been the normalizing component. In this case, for example, they normalize by the tubulin. And all this actually allows us to structure these figures. Think about this into a curatorial effort when journals comes in and you have that early on into the editorial process. You have a searchable set of tags or entities that you could get before the paper is published. And this is actually something that would have not been possible to do without Thomas Lemberger at the EMBO, several people inside Vitalité, the support of the Swiss broad group, because as I said, without them, we would have fought in a lot of pitfalls and two curators at the EMBO in Heidelberg. That gives you kind of first flavor. We need also to change the way we do curation in Swiss broad. We need to be early enough into the editorial process and not at the end. And if you follow kind of the Twitter sphere and the NIH and the Wellcome Trust, they are actually pushing for what is called as a bio for publishing as soon as possible your bioexperiment. This is an activity that is going on and is ramping up and some of the funders will actually even evaluate these ones for giving the grants to their grantee. And we are doing this in that manner already in advance to prepare the lending grant for our curators in the next, in the future. The second example is an example where we've worked with the group of Alex Raymond. And in this case, we use three types of activities, curation, structural biology, and network biology or systems biology. And this was what was kind important because the person with whom we were working called Jim Lapsky is one of the big bueno of human genetics at Baylor's College and is the most famous one in the field of human genetics. The system that has been studied is a system called the Smith-Magani syndrome from which most of the patients have mutations in a gene called Rai1. But there are a fraction, it's not a lot. On the 149 people, 15 do not have a Rai1 mutation. And in this case, there was something odd with this. They looked like Smith-Magani, but they were not Bonafide Smith-Magani syndrome. So what has been done is trying to assess whether other syndrome and they were not overlapping syndrome between the Rai1, the Smith-Magani, and another syndrome. And depending on this overlap, you might actually misdiagnose your disease. Or it might be other genes that are functionally relevant for that disease, that also play the role. Since not all genes, not a single gene is sometimes most doing the work. And again here, the vast majority is Rai1. It detects an alteration of their genes, but there is a fraction of people that looks like Smith-Magani that do not have any Rai1. And they have mutations, then although variants in different type of proteins, truncated, point mutation, et cetera. You can actually look at them from a structural point of view. This is what Nicola has been doing and identify the underlying likely function of this. This is actually the type of things that one will do systematically when he sees a mutation. He's going to try it if the structure is ready to look back at the structure. The other thing, and that's actually where the curation was important, was actually to piece the things together. Actually all the 14, actually the 13 candidates of the 15, as to be in biology, not all of them should fall into this, 13 of the 15 are in regulatory network. So that means they are regulated or connected or functioning together. So it's not Rai1 alone. All the ones that are perturbed on the vicinity with Rai1 are actually connected in one from another. And I will go quite fast on this. We also look at the expression profile across different tissues. This is actually the type of things where you can assess the cohesive sets of genes that are co-regulated or within, for example, in this case, the new one. That gives kind of an indication that some of them might be effectively misdiagnosis, but there is a good overlap in the way we bring this to set of Rai1 mutation or non-Rai mutation. This example, basically, used both with the infrastructure as well as the expertise to do that exercise. So now let me go down into the bare bone or the metalware as Roberto calls it to show you the embassy of vitality. So let's go into the infrastructure. The infrastructure of vitality is distributed in nature centralized in management. And there are some in Zurich, at the Schuve, at the EPFL, at the University of Lausanne. This is where vitality has started, at the University of Geneva, at the Yashuge, at the Campus Biotech, at the University of Berlin. The reason why we've been doing this is not to multiply in all different places because we tend to have a problem in transferring the data. And sometimes people don't want to transfer the data in a central place. In Lausanne, if you're coming from Berlin. So fundamentally, all this forced us to structure it so that it's an embassy system where we have an embassy of vitality in each and every place with always the same type of infrastructure. So currently, we have a lot of cores, a lot of petabytes, and people dreadfully are waiting for the cost model but going to tell them how much they're going to pay for this, which I hope we're going to be able to send soon. But also, one thing that is important and I'd like to spend a bit of a few minutes on this is the fact that in order to have a distributed embassy of infrastructure, we need to have a process that is a release of software that is symptomatic and robust. And the way we've been doing this, this is actually the work that Christian led for several years with a key personal, that's Robin Engler, who is meticulously and painstakingly like a curator, maintain a lot of the drive for maintaining this resource and several other people that contribute at different levels in the work. But the whole idea of this software team is that if we have only a single person that does the work, the day the person is gone, we don't have any more maintenance of software. So having several people that maintain this is rather critical. And we actually followed exactly the way Fedora is released. Fedora is one of the unique system where we package the software and we deploy them on each and every cluster. So that means a user that is at the University of Lausanne at the APFL, at the University of Geneva can use the exact same software in the exact same script to do the analysis, depend wherever he is on the vital IT embassy. It would be actually kind of hard if someone goes from one place to another from the exact same algorithm. Now, in terms of numbers, this is about 2,500 software that are maintained. This is not a small amount. This is not something that we could ask people in the community to do because this is actually critical to our work. The good thing is that when one user asks for it, it's used by many. We just is the ones used by many. So there is an economy of things and having everybody installing their own software in their own corner and not knowing how to install it. This is also quite important because we have behind the scenes several platforms and I always show the growth so the last time I checked this year we were at 48 terabytes of data every week that is generated. We were 42 in 2015 and I guess my presentation last year got a bit of a thought process to most of the people to rethink whether they want to keep data or not on vitality. So we are now at 48 terabytes per week and they come from different type of beast being sequenced flow cytometry, imaging, mass spectrometry, cell-based imaging. Most, if not all, the platform from each of the institutions are using vitality as a back-end either to compute, to store, or to archive. You can imagine that maintaining this is not something that you do on the site. You have to do that professionally also because some of the PhD and postdoc have their life in line to publish their paper and if they don't get the original data and they can't deposit that on their repository the data paper is accepted there are going to be a lot of people shouting. The people behind the scenes that I showed you before with SwissProt are these senior leaders of the Vitality Competence Center and they range from the pure hardware to the pure data analysis with kind of a mixture. One thing also we've been doing over the years is that we've been always trying to put several type of competencies on the same activities. As compared to a PhD or a postdoc whose mission is to do a PhD or a postdoc our mission is to bring the right set of people so that the PhD, the postdoc, or the group deliver the best science for themselves. So we are in a company resource we are there to help but we can't have one person that does it all or that knows it all. Not possible. That's why we've been doing and using rather an industry-like resource allocation where basically project comes in the senior scientists evaluate this say how much it's going to cost in terms of time, resource, skills and then the project can be built up. And if you look because people keep the time sheets and this is for 2016 keep the time sheet of what they've been doing this is the network of all the people those are the blue dots and the projects they are following and each of the lines pretty much says when they are working on the same project. In that way we have a kind of a view that one not a single person is doing all and that most of the time we have two to five people that works on the same project and this is actually rather unique in the way we structure by lighting. Now during the next 10 minutes actually I'm going to have perhaps 15 minutes I'm going to go and hop through two other stuff which I hope two, three, four, five, six, seven I have seven, let's see if I stop whenever you are bored and you say I want to go home. I want to show you a few examples where we put now the computing parts really to the maximum speed. The first example is the computing of epistatic effects in Alzheimer's cohorts so for those of you that are not actually biologists the epistasis is the effect that two locus together have on a given phenotype in this case we are working with Alzheimer's patients so that means the person is in a category of Alzheimer's patients so what we are trying to see is whether any two genes or n genes together contributes to the fact that the person will become an Alzheimer's patient a mild cognitive or others. Now you can imagine when you do two by two at each other position of the genome that starts to be a very big computation a computation that scales with the number of loci and the time pretty much you would heat the planet and wait for the entire time to get that. Several years ago actually Sven Bergmann who is now at the department of computational biology department at the University of Vienna the newly formed DBC actually has been looking at ways to compute epistasis that's where that's our first work where we actually put to the task someone like Thierry Schubbach for optimizing a process to compute something that people were thinking it's impossible. Turns out it is possible and it is possible to do it fully deterministically that means without cutting corner shoving data left and right and cutting with p-values this one was a fully deterministic calculation we could do that on three cohorts the T-gen, the H-server, that's the Harvard an adenis which is the largest one who has full, whole, human genome sequence you could see the number of pairs that we had to compute about 3125 billion pairs to compute and you can't wait for six years to compute this was done in roughly six to seven days through this means we were able to actually calculate and compute the epistatic interactions in these three cohorts in very little time we could also actually bring forward all the different type of genes and this is actually kind of a summary of all the replicated regions that we found across these three cohorts now I will not talk too much about this because Jérôme Dauvillier who is here will give a talk at the Lausanne computational biology meeting of January on that topic and we will go into the more gory details and all the things that do not work but I wanted to highlight this because this is actually one of the outputs of the work that we do with a research group that needs expertise and produce a system that allows us to compute this kind of interaction and this is part of an EU grants that Michel Simono at the INSERM in France has been leading over the years and that will finish next year the other type of projects we have been dealing is a project called POSNO GAP sometimes you have to come with funky names POSNO GAP is actually in the field of compressive genomics so what is called ultra-fast genome analyzer and it's a work with us, the EPFL and the HRIG in Hiverdon which is actually the HOS Rotator Specialisé that is one of the applied universities in Switzerland this is a project that was funded by the supercomputing centers in Ticino where the aim was to actually try to get other type of application that the traditional meteo or financial models or things that are very well known and also very much used to try to transform some of these and the genomes and the capacity to transform and analyze genomic data is becoming one of our big challenge and I'll show you that in a moment our colleague at the EPFL Marco Matavelli and Ian Thomas at the Asher have been working on this over the last two and a half years and we'll probably actually come up with a few presentations later on next year with that but one thing that is rather important is that we've been seeing more and more people thinking along the line of sequencing from the birth to the death the people along their life now how much this is going to cost from an economic point of view I'm not there to say but fundamentally if you have a way to do this you can actually identify several types of undiagnosed disease prevent sometimes a disease and if you have the example of Mike Snyder that changes habit and pretty much avoid the type 2 diabetes this is actually kind of the N equal 1 experiments and also perhaps and that's what Craig Venter wants to sell increase the health plan with the Human Longevity Institute in San Diego now one thing that is clear is that there's a lot of things that change in the way we process the data in this way we assess the data in the way we know about the genomes but there is one thing that doesn't change is the original information it's still ACG and T it can be coming short or in long range becoming coming pairs or alone but this does not change so the original data will be what remains the way we will interpret change with our knowledge now the ADNI sample has been one that was very useful because we had 800 whole genome covered 50 times and the first time we had this project starting that was the Alzheimer project they said we're going to shove the data by mail so we had to transfer 40 terabytes of data by mail or by FTP to see how long it takes so finally we send the machines it's not that size of machines about a quarter of it we shipped it to the US got the data on it and we shipped it back because it would have taken too long to get it from the California side to the US to Switzerland one of these genome is about 300 gigabytes of data because you have two set of reads that are there it's about 1.3 billion reads we have 800 of them it's about 240 terabytes and it's just small projects of only 800 people now we are talking about thousands and thousands of people so you can multiply that and you start to be very stressed one thing that is also true is that when you want to do this kind of analysis of just one of these files and you run some of the state of the art as it's called variant caller this is 3.2 and I think they got to 4 points whatever 1.3 and they update that this is not done in a reasonable time it was taking 7 days and for some it was even crashing so what we've been trying to do was to actually get a new ways to analyze this data and a faster way to do this we reduce that and I have to be very careful with this we hit on the head by Nicolae Christian is that we are close to below the hour to do the very same type of activities that this kind of stuff are doing if we are close to the hour and we can actually identify variants in a much more deterministic way we're in a good shape because most of these methods are not deterministic they use some kind of statistics they use some kinds of inference they are not deterministic this is actually if you take the data that is in one of these files you map it and this is actually the variance that is identified by all of them and here you have each and every of these lines is a read GATK will call this variance we also evaluate the commercial solutions which I can't really tell you that because we signed an NDA with them we can identify three of them so this company on this data was able to see three variants with the very same data set and the approach that I will not describe here due to the time we can identify robustly and reliably all these four variants and this one is effectively here it's just that it doesn't have enough coverage for being called now when we talk about variant calling people say it's already solved it doesn't exist we'll have several examples one of them that Brian is leading in the immunotherapy and the other on the project which is the Napoléon which we are hoping to publish at one stage and Alex Rémond is working on sending the paper where this method that we developed find much more validated variants than the state of the arts method that people are using or variant analysis and we've been actually selected and discussed with the precision FDA which is the Food and Drug Administration to see if this kind of method cannot be used for a benchmark of all the commercially available methods and that's going to be quite interesting to see if this method will become the gold standard for evaluating commercial solutions remember, ESIB is a not-for-profit organization we don't sell things but we support an expertise to the scientific community the other thing which is rather important and this is actually what Marcoma Tavelli is doing with different type of group from the MBL, Sanger, MIT and Stanford is in the light of sequencing more than once a person over its entire life we cannot keep 300 gigs of data every time we do that it's not economically viable the people that works in MPEG which is the moving picture so the people that looks and hack videos or rogue one from Walt Disney I should not say that it will be on YouTube they know what MPEG is what we are trying to do within this consortium is to establish an MPEG with four genomes so that basically you can integrate this over time this is something which we hope will be ready in 2019 and for which we receive a CTI funding from the Confederation the last example this is actually a project that started here at the University of Lausanne with Bernardo it's an IMI project on diabetes it's knowledge, data management and systems biology the first project that started already seven years ago is called IMEDIA it joins academic and industrial partners and it ran for five years and it focus on understanding the beta cell function if you want to look at this there is a talk from Mark on the virtual seminar series online we've been actually providing for all these projects and several others a platform to manage the data to manage the knowledge to annotate, curate, share, visualize this information we never actually promote it actively this it's used for each and every project but we realize now that a lot of people are dealing with the exact same issue which is maintaining their data sets maintaining the analysis and being sure that they are robustly structured that's why we actually call that the VCAM I must say that the name has been more of a joke than a really thoughtful process of how to call it but I like the VCAM, myCAM, ICAM etc. if someone comes with a better name he will get a bottle of wine but the whole stuff about VCAM is that it is distributed in nature because people were coming from different place with different type of data with different type of stakeholders with different type of organism and all this needed to be brought together and managed properly each of the steps when we were actually doing this project had to be curated at one stage structured and brought together and this is where the Swiss product expertise is also very important because without this awareness we would have done a lot of errors in the human deuter curator steps that are needed along this process and this is something that some other projects benefited without knowing that they benefited from it now we have actually through this a way to look at the data structure and search it to semantically link this to different type of normalization and standards much in the same way as I described for source data and to visualize and we can become extremely creative with data visualization and I could give probably 5 or 6 different type of other projects that deal with that the goal was really to transform also this data and we got lucky because in Swiss parts we have someone that has been doing RDF for the past 10 years actually now more it's almost 14 years and we've been able by bringing different type of people here Dimitri and Yervin to actually map all our data sets that we generated within this project into a semantically interoperable knowledge base this is one of the grand challenge for all the projects that deal with human samples they most of the time do not have the means or have not put enough resource to structure semantically and ontologically their knowledge they always underestimate this and they will do that later on if you are done this is done early on in the project you're in a good shape the other benefits from such an approach means that the Eminia being an RDF compliant now I have to say there are people nowadays that call this fair or findable, accessible interoperable, reproducible this is the buzz word that people have been using more and more and it's good because perhaps people realize that it's actually usable and important to actually make the data reusable Eminia being a fair data set allows pretty much to bring together other type of resources that are also semantically and interoperable this is actually quite important because we will never have a vitality of 500 people I hope not we will have other groups that are going to produce certain type of new data that we will need to bring in and this is actually the type of thing that is a real challenge and we can't just invent yet another standard yet another form so by this means, by transforming what we generated in the project into a reusable framework means that we could actually enhance and integrate other resources the people that have been working on this span both SwissProt and Vitality this is led by Marc we've been doing that for the last few years and the good thing is we got a follow-up actually two follow-up one is called Rhapsody which is a risk assessment of progression DCs for diabetes, sorry and it's really based on pre-diabetic type to diabetes it's a large consortium like the EU likes but unlike the H-2020 projects where people go and take the money and do their stuff this one you work together in order to move forward something that the farmer will be using later on as an early access program or for the decision making to treat or to develop new drugs in diabetes this is actually quite important because it bears on three projects that run over the last five years three IMI projects where Switzerland has been in this case the SIV is the data coordination center for immediate in the direct the group of Manolis Dermitakis is one of the main contributors in the direct and in some it I don't think we have anybody from Switzerland if I'm not mistaken but at least two of the three platforms are contribution directly or indirectly from the SIV or from groups from the SIV and we are in this project at the interface of federated database we don't believe in a centralized database we believe in federated way people need to work together the confederation work the federation has to work and it's more difficult and we are actually contributing to all these three groups so we are rather critical on disease stratification diagnosis and you will hear more about this in the coming years and early next year with the second IMI projects that comes I don't think we would have been able to do this if we were just taking this project as yet another project that funds us we took this project and took it seriously to structure our infrastructure our expertise our know-how actually allowed us to get this kind of project because we are professionally maintaining a competent center so to wrap up and you see 154 sites goes in an hour I hide a lot of them behind the excellence in competent centers I hope I could convey to you the difference and the importance of having this together it is really important because research group benefits from this and sometimes they benefit even indirectly from these projects if you have end groups that compete with each other it is extremely rare unless they are joined to one project that they will benefit directly from this a competent center will directly have a direct benefit to the people because the people are there the infrastructure are there and the people and their knowledge could be applied to another project the projects I did not talk about today and I could have gone for this our projects on trisomy detection this is actually a product that is on the market that we developed on the algorithmic side I talked about this last year a project where we characterized lines or biotherapeutics lines for CEL-axis which is actually a real competitive advantage for that company and projects which actually enables flow cytometry and I hope at one stage we get at the virtual seminar several other people presenting this to a more user point of view now to finish up in conclusion the competent centers are critical in the 21st century both for development and both for economic impact it is big, that's true it's a lot of people, that's true but it's also a way to transform some ideas into a product some ideas into support and this is actually a place where if the competent centers are looking outside and not inward you can actually have people benefiting from this directly the type of work performed by the competent center are enabling its user community to accelerate their research avoid pitfall, manage their data in an adequate way and provide the necessary training because we have not to forget that a lot of people who are helping in the universities are PhD postdocs sustaining these competent centers and their impact is a critical aspect for which we have been well supported by the Swiss condo for the SID and I think in the future with a type of activity like the Swiss personalized health network this will make even more important for the future but also internationally other organizations at the european level like elixir at the US level like the big data to knowledge also realize that they can't throw money at people generating data they need to have competent centers that takes care of these data structure it, make it reusable and train people to use it adequately this is actually something which honestly 5 or 6 or even 8 years ago was not something that people realized now it will not change the way we work now in 10 or 15 years that's why I put the beyond and that's why I want to have this one recorded so I'd be interested to see the next director of 8 lightings who has brought in 10 years looking at this talk and saying that I was saying we are right or wrong so with that I thank you for your attention and I ready to take any questions comments you have or anybody that needs to leave thank you you