 So, hello everybody, my name is Andrey Zinoviev. First of all, I would like to thank you, Dima, for this kind invitation. It's an honor for me to conclude this exciting workshop during which I learned many, many things, which I hope will be useful at some point in our research. I work at the Institute Curie, which is a cancer research center in... it's distributed in several places in Paris, including Dysorsée University, but also the center of Paris. And I'm going to talk about biological networks, types of networks which I believe was not discussed during these four days, so I kind of will give you some ideas how biological networks, which is one of our main tool in our research, can be... can be approached and understood by Google Matrix idea and some related ideas. So biological networks basically is an attempt to represent in a form of... in a graphical form, in a form of graphs, complex biochemistry happening inside a living cell. So this is an example of biological networks which represent a global reconstruction of all metabolic reactions happening inside a human cell. Currently this reconstruction contains more than 10,000 reactions. Formally speaking, this reconstruction represents a bipartite graph, which is kind of zoomed out here. So here small squares represent reactions by chemical reactions and then by errors you see chemical substances which can enter into the reactions and come out of the reactions. So this is a kind of classical representation of chemical reaction networks. Metabolism is only a small part of the story, so these metabolic networks they describe what human cells eat, how they reproduce, because we can talk about metabolism of DNA. But there are two important layers which are connected to these metabolic networks. The important layer of equal and even bigger size can be called signaling networks. So these networks, which also can be represented as chemical reactions, represent how a human cell interacts with an environment, which signals it can eat. It receives from outside the cell or from other cells and how this is translated into the changes into metabolism, but also there is an underlying layer of transcriptional networks. So these are the networks which describe how a human cell reads information coded into DNA, how it reads from a template in a cell-type specific manner, and again how these instructions encoded into DNA are translated into other layers of signals. So this is a very rough representation of what we call biological networks. The dream of the science which is called systems biology is to create global mathematical model, dynamical model of these networks, which for the moment is more or less hopeless exercise, because these networks are huge. In principle we have mathematical formalism, which is called chemical kinetics. It's a formalism designed to put these networks into equations, but the problem is that for majority of the parameters we have very vague understanding of the order of magnitude, and we know that for many parameters of these networks we will never have quantitative estimations, so this is more or less hopeless. There are many approaches which aim to suggest some simplified approach to chemical kinetics formalism, like flux balance analysis which looks only onto the stoichiometric matrix, some kind of ideas of modularization and abstraction. When I was postdoc here at HHS for three years, I had time to think about these kind of approaches, for example, related to the idea of introducing macro variables and describing chemical kinetics in analogy to thermodynamics, or use some assumptions about distribution of kinetic parameters and their qualitative relations, which parameter is bigger than which in each fork of reactions, to suggest some asymptotic approximations of chemical kinetics, which made part of this idea of asymptotology. This is a misprint of chemical reaction networks, which we worked on together with my ex-scientific advisor. But again, all this is part of a very long program. To be pragmatic, people simplify enormously the structure of this biochemical reaction network into simply graphs. These are a couple of examples of such simply graphs. One very simplistic representation of biochemical networks is just to put an undirected graph. Here it's a schizophrenia protein-protein interaction network. Each node in this network is just a protein. It's a basic building block of a cell. The link describes the fact of physical interaction between these two proteins. If two proteins are interacted, it means that this interaction was selected by evolution. It means that it can probably serve for a certain purpose inside the cell. And large networks like these are called protein-protein interaction networks. The basic mathematical model of these networks are simple undirected graphs. But in the type of biology we are dealing with, people do a small step further. Further, they introduce directed graphs where nodes are again proteins, but here the direction gives you an idea of causality, such that one protein can cause a change in the function of another protein. These are in other types of networks, and this time they are represented by directed graphs. These networks that I showed here, some of them are measurable in principle. It's quite expensive, but some you can measure in a given sample, so you can apply systematically some experiments and derive the structure of the network. People do it and it's very interesting, but most of them are constructed not by direct experiment. They are constructed by reading literature. Because the information about these networks is distributed into thousands of publications, people read this and extract this knowledge in some graphical form. So we do it also in our group for more than 10 years. So we construct a resource which is called Atlas of Cancer Signaling Network. This was an idea that I had when the group has been constructed. And the idea was to construct a kind of world map of biochemical networks involved, implicated into cancer. So this map is called ACSN or Cancer Network. So this is kind of what you see, how this network is organized. Again, it's a large bipartite graph containing more than 10,000 reactions today. The idea was to embed this network with some pre-designed layout. Of course, this graph is not planar, so you cannot represent it without intersection of edges on a plane, but you can make it approximately planar and organize in such a way that it would reflect our knowledge about biological functions implicated in cancer. And then you can embed this into the system of Google Maps API with a function of semantic zoom such that when you zoom out, you see more details when you zoom in. You see more details when you zoom out. You see kind of more abstract representation of the network. So what can be the use of this besides just browsing and forming some queries to this network, you can also take some quantitative information that we have about biological networks. For example, you can take a cancerous cell and normal cell, see what are the changes in the concentrations of proteins between two types of cells. Then you will have a number for a protein which will tell you how much this protein changes in terms of presence, concentration between two conditions. Then you can project this information onto this map. And usually, in this naive form, this visualization is not very insightful, but if you will use the structure of the networks, your connections between proteins to smooth your data, then suddenly you might see some patterns in the data which might be very meaningful. So here, for example, I mean red color means positive change, green color means negative change, and you immediately see that some functions which are represented on this plot, they are over-activated or down-regulated. So this is a very small example of how biological networks can be used together with the data that we systematically collect nowadays in cancer research, for example. So this leads me to this kind of introductory slide where I try to answer the question, what is the use of biological networks in biological data analysis? And I think the central message of this slide is that biological networks allow you to avoid very fruitless discussion about what is biological function. The notion of biological function is the central in modern biology. The idea is that all genes in your genome should be labeled with small sets of functions which they do in the living cells. So for example, I mean the kind of the dream of biologists is to put several labels on each gene and say that this gene is implicated into this function, this function, this function. Unfortunately, this is a very naive idea because, first of all, many genes involve in many functions, but there is a problem with the idea of biological function itself. The definitions of biological functions are very vague. For many proteins, we don't know what are their biological functions. No idea, approximately for half of the genes we still don't have any idea in which functions they involve. So when we have this large graph connecting proteins, in a way, we can get rid of the idea of biological function we can say that the function of our protein is a position within the network. So instead of saying what is the function you can just say that this protein is located in this area of the network and this is its function. So which makes it very useful in to downstream data analysis. So you can combine the information about functional proximity which is defined by biological network and the data that you collect on the biological system. So this leads to, I mean, the most pragmatic implementation of this idea is a so-called guilty by association principle in analysis of biological data. I mean, but not only in biologists is a guilty by association principle is used. In Facebook, I mean, even if your Facebook page is completely empty, no content, but if you have a list of friends, I mean, a lot of things can be can be inferred from your friendship neighborhood. Same in biology, if you don't know anything about a protein but you know with which protein this protein interacts you can tell a lot about the function of the protein. There is a lot of discussion can we go make step beyond this idea of guilty by association by direct neighborhood. Can we make two steps from a protein, three steps and so on. So this leads to the idea of this network propagation or propagation of influence which would be more distant than direct neighborhood. So this becomes the main walking idea in the field and there are even reviews. So this is a recent review from Nature about network propagation, a universal amplifier of genetic association and the first figure of this review there is this picture which is quite familiar to you is on which you say that if you know the structure of the network and you have an initial perturbation then by idea of random walk you can derive some stationary state in this network which will describe some smooth function which will give you an idea of influence of initial perturbation on the rest of the network. So 10 years ago we converted this network propagation idea so our group as a matter of fact was one of the first to apply network propagation for data analysis in cancer research, in biological research. So we discussed this idea of using smooth functions defined on the graph and we suggested a very simple approach which is analogous to Fourier transformation on graphs so using the properties of Laplacian of graphs of biological networks such that if you have some function projected onto the graph of biological network we can distinguish smooth component on this graph and high frequency component on this graph and say that we are more interested in this smooth component. So we apply this idea to some classical machine learning tasks and for example in this paper we try to understand the effect of radiation on these cells so these are experiments which was done at campus of our university here and so some systematic measurements were done on non-radiated in cells and radiated in cells and we applied classical support vector machine on this data which by definition it's not obliged to behave smoothly on the graph of biological networks but if you impose smoothness then the weights of support vector machine will be distributed much smoothly on the graph which makes the results of your analysis much more insightful so you can immediately indicate which parts of the biological networks contribute positively, affected positively by radiation or affected negatively by radiation In similar project we use the network smoothing to improve the layout, constructing layouts of biological networks so layout in networks since they are quite complex is a challenging problem because the state of the art force-directed layout frequently doesn't give very insightful layouts and they are not very useful to look at the biological data so we used an idea of using the data itself to first smooth this data on using the structure of biological network and then use this dimension reduction into layouts in biological networks on a plane so this helps to create more insightful layouts of two-dimensional layouts of biological networks so this is a slide which I use usually to explain to biologists which are frequently very naive in mathematical formulas to explain the difference, two main approaches to network propagation so approach connected to simulating diffusion on a graph and random walk with restart so usually I use this analogy with a drop of ink which you put on one part of the graph and in diffusion, I mean the stationary state the ink is uniformly distributed on the graph so you have to stop somewhere in between to obtain some smooth distributions on the graph so this is connected to the use of properties of Laplacian and then there is this idea of random walk with restarts the most famous implementation of this idea is page rank and Google matrix that we all discussed here so again this is a simple explanation of how page rank algorithm works I'm not going to stop on this what I would like to show is how this network propagation can serve for very concrete purpose of predicting the effect of mutations on survival of a patient with a diagnosed with the cancer so the type of data that we started to obtain relatively recently is genomic information about tumors extracting from the bodies of the patients so nowadays when a piece of tumor is extracted from the body it can be characterized in terms of the genome and one important characterization is how many mutations so single nucleotide changes in each gene a current tumor accumulated during evolution of the tumor so this is the process accumulation of mutations is a process which is physiological in a way so starting from birth, from fertilized egg each division and principle can bring several mutations into the genome and when there is a cancer appears in the body somewhere usually it leads to much faster accumulation of mutations and when some modern therapies apply such as chemotherapy it can even more trigger the accumulation of mutations so as a result so today we have these tables where for a large cohort of patients several hundreds of patients we can characterize the tumor of each patient in terms of the mutations in each gene and the most frequent representation of this information is just binary matrix where at the intersection you see which gene is mutated in which tumor so you can take this in principle we believe that part of these mutations are causal for the way the tumor will respond to the therapy so it can be used for prognosis of how successful will be treatment of the cancer but there is a problem with this type of data and this problem is described on this slide so if you take simply this matrix and if you will take y vector which will give you a number of months a patient survived after treatment of the cancer and you want to predict this survival in terms of the concordance index so roughly speaking concordance index gives you an idea which patient survived longer than so you try to predict for each pairs of patients which patient will survive longer after initial diagnosis of cancer and of course if you just make a random guess the concordance index average concordance index will be 0.5 so the point is that if you take a mutation matrix and we are playing some state-of-the-art machine learning method for this matrix to predict this concordance you can see from this slide that the prediction is very poor so it's basically statistically indistinguishable from random prediction so somehow this kind of direct application of this data in predicting cancer survival doesn't work also if you will try to cluster this data for example if you will try to apply some state-of-the-art matrix factorization techniques for clustering this data you will fail because the clusters that you will obtain will not be very useful when you apply clustering typically your desire is to obtain some clusters which will be more or less equilibrated such that they would represent some meaningful groups in the data for this type of data when you apply clustering the observed behavior is like this so you obtain some kind of gig gigantic cluster and very few clusters which contains one of two patients so it's not useful for any particular purpose so this is a gene by gene for the cluster so your initial matrix is patients by genes and then of course you can define for example Euclidean distance based on this vector between two patients and clusters how many genes typically are there yesterday in our genome we have I mean it's again discussion but from 20,000 to 30,000 genes so this is a kind of dimensionality of the vector and few hundred samples usually so but I mean this is very relevant kind of information because I mean this slide explains why this cluster in Opal kind of trying to predict something from this data doesn't work so first of all you have two factors which affect this failure the first factor is there is a very unequal distribution of total number of mutations so whatever cancer type you will take you will profile for mutations you will find that for some tumors you will find more than thousands of mutations and for some tumor types you will find very few mutations for some tumors you will find very few mutations and this heterogeneity in terms of the total burden of mutation is a typical characteristic for any collection of tumors that you will take at least for adult cancers for pediatric cancers it might be not the same but for adult cancers this is more or less universal picture so you have a large confounding factor which affects any application of machine learning techniques which just means that you need to normalize preprocess this data somehow to prepare for further analysis but even more drastic feature of this data is that overlap between mutations is very small so this is a matrix of overlap patients-patients so it shows how many mutations two patients have in common and you see that I mean typically you have very few less than 10 mutations in common between two patients so which means that mathematically speaking this data that I showed this matrix in terms of multi-dimensional distribution forms a very highly dimensional hypercube and you know that I mean apply meaningful way machine learning techniques for such distribution of data is not very successful cannot be very successful so people think a lot about how to reduce dimension of this hypercube how to create kind of useful low dimensional projections of this data and one of the most productive and in a way ingenious ideas was to use biological networks so the idea is the following imagine that you have biological networks which describes as I have said before biological function in a way you can take two patients in this case let's take patient one, patient two so mutations of patient one is shown in yellow color mutations in patient two shown in blue color and the co-occurrence of genotype one and genotype two is shown in green and just as a model example you see that on top of this biological network you have only one co-occurrence and this symbolically represents more or less what's happening in reality so you have very few common mutations between two patients two tumors then the idea is that you apply network propagation so you say that mutation affect not only the node of the network not only one single node of the network but it is spread it influences the function of neighborhood of some neighborhood so for example you can apply as I will show in the next slide some random walk-based approach to describe this field of influence and then instead of binary matrix this matrix will be converted some continuous non-negative matrix which will describe kind of how much mutations are affected the neighborhoods in the graph of biological network and then suddenly you will see that the overlap between two genotypes might be very significant in certain parts of the networks and you can say that this part of the network is affected by mutations and this changes a lot our ability to cluster the data and also to make predictions out of this data so this is a kind of trick to utilize this data which use biological networks in terms of mathematics, the mathematics of this approach is very simple so you encode your mutations into what was called the preference vector in the previous talk so you say that you have a network, you have A and D is mutated then in this particular work they apply something which is similar to Google matrix not exactly the Google matrix approach but a related approach just the matrix is differently normalized so A is adjacency matrix, D is normalization factor also you have the same alpha which has taken 0.5 in this particular paper and then as a result of application of this random work strategy you have a smooth distribution of influences so for example node i which is relatively distant from mutated nodes is not affected but node B which is close to two mutated nodes I mean it receives the highest score so this is irrational so we spend with two of my colleagues with Jean-Philippe Vert who is now became a part of Google brain headquarter in Paris and Marine Le Marvan we try to reimplement this approach and try it on different cancer types so these are different cancer types on which we are tried lung cancer, breast cancer, glioma, kidney cancer and so ovarian cancer for which we have significant amount of data we reapply this approach using the state-of-the-art prediction and you can see that with respect to random prediction we still not for many cancer types it didn't show any change with respect to the previous work but for two cancer types you can clearly see some benefit from this trick but also what we learned that you need to apply some normalization in that case it was quantile normalization and also what we discovered that in terms of network propagation is a matter of fact using the first neighborhood is already sufficient to achieve this level of prediction so in this particular application it is not that useful to use network distances longer than one so as a matter of fact this led us to some kind of simplified implementation of this guilty association principle using only the first neighborhood information in the network so in this approach that we suggested which was published last year, net norm the idea that you introduce a parameter which is k which is a kind of reference number of mutations that you want to introduce in the network then if your tumor have less than k mutations so in this example k is 4 this is a small biological network in this case you have 3 mutations if you have less than 4 mutations less than k mutations that you add a new mutation which kind of fictitious mutations or proxy mutation if you want but you put it into the position which will be most connected to already mutated nodes so in this case the most mutated nodes is here I mean in terms of neighborhood and you put a proxy mutation here and you add as many mutations as you need to achieve k if you have more than k mutation you do the opposite you remove mutations from your network and you start removing mutations from the peripheral nodes so you end up with k mutations but which kind of assigned to the most connected nodes so the rational of this approach is very simple the idea is that if many neighbors of a given node are mutated that probably its function is also compromised and also if your mutation is associated to an orphan node probably it doesn't affect very much but it is an essential biological function in a given tumor in this case it's a kind of mixture in this case it was close to what I depicted I mean this network which combines 3 levels there are relations of metabolic type signaling type and transcriptional regulation type so it's a mixture of many different types of regulations and the network that we used here is quite dense so as a result we show that this simple approach already works better than this network propagation reference method and also kind of funny question that we asked ourselves is what if you would use instead of biological networks you would use random networks just completely rewired networks I mean intuitively you should expect the iteration of the performance of the algorithm which is not totally the case even the funny observation is that even if you use random networks some kind of random projection of multidimensional data in some kind of random plane or random low dimensional space you still improve your prediction nevertheless biological networks they perform better than random networks this is a kind of positive conclusion from this study and the method that we suggested which uses only first neighborhood relations it benefits from the structure of biological from the real biological networks much more than the state-of-the-art method okay so now I go to Google to our collaboration with D-Major they and Klaus on application of Google Matrix and reduce Google Matrix approach to biological networks in several studies we exploited idea of reduce Google Matrix first of all we applied it to reconstruction of a signaling network which is called senior it's relatively small network which contains 3000 nodes, 7000 of edges so this is a typical hairball which you obtain after application of some kind of force directed layout algorithms this is a distribution of proteins in the network attention to the colors here they don't mean much on the page rank chair and plane you see that there is no correlation between page rank and chair rank and this is a typical characteristic of biological networks those nodes which are very influential have many outgoing edges usually not the same as those which are tightly regulated and receive many incoming regulations we know that page rank is usually proportional to the page rank value is proportional to the incoming connectivity of a node but as we see from these graphs there are important deviations from this proportionality law and for us it was interesting what kind of patterns in the network lead to deviation from this proportionality rule and we described several kind of characteristic patterns which lead to the fact that despite very low relatively low incoming connectivity the node receives very high page rank so they rank very high in terms of page rank so there are two kind of stereotypic cases one case is when a node here the size of the node is proportional to the page rank so the bigger it is the more it's ranked and the color denotes the ratio between incoming edges and outgoing edges so if it's red it means that it receives much more edges than outgoing edges and you see this is the kind of pattern when a protein is regulated by many already hubs so I mean the kind of hub of hubs so it collects influences from many hubs in the network so many hubs point out to this protein another example is when a protein is at the end of the relatively long cascade of regulations so each protein on this cascade receives a number of regulations and somehow all this regulation canalizes to the end of the sync node so this is another pattern now in the opposite when you see a significant deviation from proportionality rule from the distribution of outgoing edges and chain rank you can detect some proteins which are known for example TLR4 these are kind of symbolic names which are given for proteins because we don't know how to name them but this protein is known as a major trigger of immune system and there is a very huge cascade of regulation downstream of this node so by itself it nodes only regulates two proteins but it obtains very high chain rank or proportionally high page chain rank because it regulates many functions downstream so now this is a slide explaining reduced Google matrix which was already explained before the idea that we have a global network and some nodes which we are interested in we have direct interactions inside this smaller network but we can also using this formula and the GQR member of this formula indirect interactions so we decided to use this approach to see how the structure of reduced of these hidden connections between nodes in the signaling changes in response to between cancer and normal conditions so as I have said we have several types of networks in this case we analyze the signaling level and transcriptional level as I have said transcriptional level we can measure directly experimentally so in principle we can take two cells so in this case what we did we took a normal leukocyte and a B-cell from leukemia so in principle these are two cells of the same type but one of these cells behaves like a cancerous cell and another cell is a normal cell so by other people there were measurements of transcriptional network in two types of cells so we assume that the signaling level remains the same because as a matter of fact mutations which happen within cancer cells usually do not affect that much the interface between the proteins so they interact the network of interactions between proteins is usually not that much rewired so we assume that the signaling network has the same structure but at the level of transcription there might be many changes happening so the transcriptional network might be very different between a normal cell and a cancerous cell so this is a kind of multiplex network if you want so we have same set of nodes connected at one level by the interactions from the signaling level and then the other level by the interactions from the transcriptional regulation level so as a result you can substitute the transcriptional level and you can study it in normal cell and in cancer cell and you can see how the structure of hidden interactions will change so in the end you can introduce small nomenclature you can say that you might have hidden interactions which do not change between normal cell and cancer cell you can have hidden interactions you can merge the response to cancer or you can have interactions which would disappear in oncogenic network and we thought that this is a very interesting information to have a look at also what you can study is how page rank changes between normal condition and cancer condition so some of the nodes you expected to improve page rank it means that they will be more regulated in cancer decreases their page rank which means that they will be less regulated in cancer all this is an interesting information to have a look but before having a look at this we looked so this is a picture showing change of chair rank and page rank between normal and cancer conditions for the same cell type related to leukemia of course and one of the first observations that the low ratio of chair rank changes three times which means three orders of magnitude larger than page rank changes which indicates that in cancer the most of the changes are happening in the transcriptional network at the level of outgoing ages which means that some proteins which are potent regulators of multiple biological processes they are completely rewired between normal and cancer so in another small study we took a set of genes which are all known to be proliferation genes and these are direct connections between genes so which just indicate one very known hub in this network CDK1 again I'm not going to explain what this protein means what this means but what you can see is that just using information about direct interactions just infer existence of one particular hub which is very well known if you will use reduced Google matrix approach you see that first of all the genes which are involved in proliferation they are much more tightly connected between each other you see emergence of other hubs which is kind of an added value of reduced Google matrix approach but also what is interesting is that you see examples of disappearing interactions between norm and cancer for example in this case so these interactions which were detected there here and you can see that one protein which is in biology is known to be mitotic spindle checkpoint regulator whatever it means it doesn't regulate function of these proteins anymore this, this and this as a result they are page rank decreases so they are less regulated which might indicate to you the strategy of improving the situation with this particular leukemia cancer type leukemia by acting on the regulation of this particular proteins it doesn't tell you what you should do with these proteins kind of improve their concentration or decrease their concentration but at least it indicates that this is a potential intervention point okay so now I switch to the final topic of my presentation which will be about Wikipedia and we had already many talks about Wikipedia and I must say that for me Wikipedia is very exciting object because it somehow represents human knowledge in a graphical form of course it's some kind of characteristic vision of human knowledge very incomplete and so on but probably this is the best that we have for the moment represented in a formal way and there are many things that we can learn from Wikipedia despite the fact that I mean only known facts are put in the Wikipedia as a content but the structure of connections between pages is something that can be studied and can be used to infer some knowledge for example I mean one funny example that I found when I was preparing this presentation is about just take a Wikipedia page with some enigmatic title I mean for some of you it's not enigmatic of course but transhumanism for example and makes three steps from this article in Wikipedia network constructing network of interactions in the Wikipedia connections in Wikipedia around this transhumanism which will give you definition of transhumanism so without knowing what is transhumanism you will know what it is just by looking at this kind of neighborhood just funny funny example of course if you know what is transhumanism it's not useful but imagine for proteins for many proteins we know very little about what they do so in case of proteins such objects have proteins this approach might be much more interesting to apply again instead of looking on one particular page you can take a set of pages in this case I took cat, dog, coyote, rabbit just some random imagination so these are direct links from direct links about this but if you look at the hidden links of indirect connections between two pages you can immediately learn that these animals have paws, they have fur they are mammals some of them are subject of selective breeding some of them can be characterized in terms of predatory relations some of them live in grassland in North America so kind of immediately just by looking at second neighborhood of these nodes connecting these group of pages you already can learn a lot so our temptation was to use these rational to understand the relations between proteins and indeed within Wikipedia there are approximately 10,000 pages devoted to proteins so this is one of them protein, fillamine A again it doesn't mean much guess what it does in the cell just from its name but what is interesting is usually in the bottom page of this protein you see known interactions of this protein with other proteins so there is a section which means that you already have those interactions that we know that are characterized already somehow encoded in Wikipedia usually in automatic fashion so it's kind of extraction from existing pathway databases which are automatically inserted into these pages this information I mean it's a semi-manual process so people might add of course these interactions but usually it comes from existing databases so if you look at the structure of direct interactions it's no surprise in Wikipedia in the protein-protein interaction network embedded into Wikipedia you will see this hairball which will be by all topological measures by all characteristics very similar to existing and to known reconstructions of protein-protein interactions so in this sense it's interesting that inside Wikipedia you have protein-protein interaction network but by itself this network of direct interactions is not very interesting because most probably it's just reflection of what was put there from some existing protein-protein interaction networks nevertheless we can take advantage of the fact that this network of protein-protein interactions is embedded into the network of human knowledge and it's interesting to ask does it bring anything in addition I mean our conclusion was that it does so even if you look at the structure of hairball which are normalized by density so we select as many hidden leaks the strongest hidden leaks such that the density of this hairball would be the same as the density of direct connections you see that the hairballs are not quite the same and you can see already the existence of certain communities inside this this graph so this is interesting because it means that by indirect links these proteins are connected not in the same way as by direct links and you can demonstrate it more or less explicitly by looking at degree distribution which have this particular pattern in the hidden links connections so in this case it is the average chi-clustering coefficient which is high in hidden network and the average connectivity of nodes also you can see that there is a systematic bias towards high connectivity high clustering into the hidden networks which means that somehow the environment the neighborhood of proteins in Wikipedia defines the notion of a function of the properties of proteins and this is probably kind of this knowledge might be exploited for definition of function so for example you can cluster your network of hidden interactions and it will give you more or less well defined communities and it is surprisingly easy to put labels onto these communities so the largest community that you have is devoted to immune system then there is a community devoted to cell cycle glucagon metabolism base signaling potassium on transport which is important for brain function, apoptosis, coagulation keratin spheroxyzo so in principle all these communities are quite I mean they are quite natural from the biological point of view which is not a big surprise because they are connected indirectly by by some common knowledge links nevertheless it is very useful that the community structure is not explicitly present into protein-protein interaction itself so there is an added value of this connectivity part inside Wikipedia which might be used for defining what is biological function also we studied the difference between 2013 version of Wikipedia 2017 version of Wikipedia we found that most of the communities have quite good match though we had some surprises for example in 2017 we discovered a new kind of super community which connected proteins which were not at all which were very loosely connected in the pre-usation of Wikipedia so we try to understand what is the nature of this kind of largest community but also we looked out of curiosity how we can automatically annotate these communities because for example let's take this calculation related community and again these are the list of proteins which are involved in it and if you have never studied biology I mean these names might be completely out of meaning for you it doesn't give you any a priori reference but if you will look at second order neighborhood which connects these proteins and you see these all these labels which appear from the Wikipedia first of all you can find the scheme of calculation which is a kind of textbook image about calculation which connects all these factors into a cascade you can see what are the diseases associated to this community you can even find a person who was the most visible person studying calculation you can learn that these calculation factors are produced in liver you can see what are the drugs which are used to treat calculation problems and so on so this is by itself is an interesting kind of compliment to what we can learn about the structure of these communities another example is about this community in the network again names of proteins if you didn't study biology are completely without any reference but you can learn by this analysis that it's related to cell migration and cell migrates through extension of philopodia and changing the cell polarity again what about this community super community which appeared in the version 2017 it was funny because first observation that you make that in the center of this community the most central note is myoglobin so which might indicate that between 2013 2017 was a major discovery in terms of myoglobin function but as a matter of fact it looks more like an artifact because what you can conclude is that if you look at the structure of what explains these hidden connections this is a network of hidden connections so these hidden connections point to myoglobin but if you look what is between these proteins and myoglobin you see basically two pages one page is called protein-protein interaction another page is called protein so apparently between these four years between 2013 and 2017 a lot of content was put into these two pages so they increased by size by order of magnitude so they have much more links and much more proteins are connected to these two pages now and on both pages myoglobin is presented as an example of the first protein whose structure was deciphered in terms of crystallography so this explains the appearance emergence of this super community which combines proteins without any obvious biological function okay my conclusions is that network propagation is a powerful tool which allows to jointly analyze the data that we have today in cancer research and somehow introduce into this analysis our knowledge which will be free from the very rigid definition of biological function second fortunately unfortunately it seems in most of the applications that first neighbors already give you enough information to make predictions and to use these biological networks so I highlighted this notion of creative elements so those elements which are not that tightly connected into biological networks nevertheless they obtain high range rank and this creates these nodes interesting and this story of wikiprot and hidden communities I think there is an added value of looking into wikipedia in order to understand what is the consensus definition of biological function in the community because wikipedia is the result of collective work and somehow you could see there is an emergence of these communities inside wikipedia devoted to particular biological functions so these are the people who were involved in this work including those people present here so this is a waterfall which is called hell waterfall don't fair and other people from our group which made some pieces of the work that I have presented so thank you very much I am ready to answer your questions as you worked with Jean-Philippe Perre do you use kernel methods or graph? not in this project that I mentioned but of course I mean one of his famous kind of tools that he introduced into computational biologies indeed graph based the kernels that you can immediately insert into a classifier or any PCA or whatever dimension reduction technique and indeed sometimes it's very constructive the longest I have is most of these information is in the form of network so it's in the form of pairwise interaction in some cases you have to be natural with functions in biology like regulation like the simultaneous regulation of the gene so protein complexes they are made for things so do you think there is a sense in which these approaches might be blind to this higher of the production and there is a sense of going towards higher of the interactions indeed these high order interactions there might be complex patterns of wiring these interactions we exploit these high order interactions into construction so-called logical models when we try to introduce logical rules on how combination of inputs leads to the change of the function of mainstream proteins there you explicitly take into account this combinatorial nature of interactions here to what extent you use it here probably you don't but maybe you can complement this so other questions actually it's very similar to you your question actually the network of chemical reactions is essentially a but with with this with hyper edge with directed with directed in complex way hyper edge so that for me it's not very easy to understand how in principle it's possible to approach this google page ranking for such we don't apply directly to stoichiometric stoichiometric networks to these bipartite networks indeed but I mean this approach that I mentioned flux balance analysis for example it deals directly with stoichiometric matrix which takes into account this bipartite nature of the graphs which also can be used to highlight some central notes into the chemical reaction networks but indeed directly it's applicable but once again I mean in theory we have this chemical kinetics formalism bipartite graphs and so on but if you look at what information we have at large scale usually it's formalized in the form of these simple influence graphs that I'm protein A causally related to the function of protein B we are still lack a lot of details behind these regulations okay so thank you last speaker and probably it's time to thank also the organizers of this lecture thank you