 Good morning. My name is Amrit Pralaj. I am from National Institute of Technology of India. It is in the eastern part of India. As a Khorana scholar, I have been working here in Robert Nomad's lab. It is in the Eastern department in Injury Hall. The title of my project is, Efficient Structure Learning of Biological Networks. First, I will show you a picture of a biological network. This is how a biological network looks like. It is a generating network of these. So it is made in such a way that the genes that have similar in functions, they are clustered together. As the different colors here, these colors correspond to different functions. So the importance is, from these clusters, we can say which gene has which function. So it's a high priority of biological functions. And from chemical generating interactions, it can be known that which chemical creates hyper sensitivity in which part of the map or in which cluster. So it leads to the identification of target or drug or chemical and hence drug discovery. So the question is, how to build this biological network? So I will discuss some of the methods that we use currently and later I will say an efficient way to make it do that. So the first step of the current network is to measure the interaction values between each gene. So that is the experiment called as FGA, Synthetic Genetic Array. So let's say we have two cells, this is A, this is B. Then it undergoes gene knockouts as this and this. So they become single mipplants now. Then they are allowed to make and double mipplants are formed and the connect double mipplants is isolated. Now let's say this is A, B, this is A, this is B. So the fitness S, A, B, S, A and S, B is measured and it is put in this formula. This S naught gives the interaction value between cell A and cell B. So if that interaction value is above the threshold, it's considered to be a significant interaction. And these interaction values built up the interaction profile or feature vector for each gene. So let's say one gene interacted with ten other genes. So for simplification I have shown a binary interaction profile. So if there is an interaction of one gene and ten other genes, there will be one and if there is no interaction, there will be one. So the database on which the current experiments were done is an online gene, each gene database, it's called dry gene. There are four, four, five-way genes were taken. It was split into two sets. One was called query set and aleset. One set, one one gene and the aleset has three different genes. And it looked somewhat like this. So the query set was stored as shown here, as in rows. And the aleset was stored as in columns. And these are the interaction values between a query gene and an alia gene. So for each query gene, each row will be its interaction profile. And for each alia gene, each column will be its interaction profile. So NAN means not a number. It means there is no interaction between a query gene and an alia gene. And these numbers show the interaction. So the step two is to find out the similarity between the genes due to correlation. So let's say the interaction profile of gene mark is this. And this is the interaction profile of gene two. So to find out the correlation, one has to multiply these corresponding values. So this comes like this and then add it up. So the correlation means that the genes that are highly correlated, they are more similar. So they should be together in the cluster. And the genes that are lowly correlated, they should be further apart because they have no similarities. So when the next step is to group all the similar genes together and to move the genes that are not similar further apart. So that is done using some clustering techniques. So these light boxes represent that all the genes are clustered together. These are the clusters. And these dark boxes represent that they are like lowly correlated. And these are the highly correlated parts. So from here we get the clusters. And then those clusters are brought together to form a functional map. I was shown in the first picture. But the problems that we face in the current method is to better these interaction profiles or these type of genetic network map. Excessive measurements have to be done. So lots of experiments needs to be done. So for each query gene to build up its interaction profile, it has to interact with each of the array genes. So let's say in the array gene we have 1, 7, 1, 1, 2, 3, 5 arrays. So total experiment that needs to be done is around 6 million. So one way to make it efficient is this. Evidence and previous studies shows that biological processes are sparse. This means that each gene will never interact with all of the genes, but it interacts with its own kind of similar functional genes. Hence they cluster together with its own similar functional genes later on. So if this is the interaction profile of a gene, this is sparse and it has just 2 nonzero elements. So for the next gene, its interaction profile is sufficient to insert this. We don't need to do all these experiments and get these values because ultimately when we find a similarity between these 2 genes, these values will be multiplied and as these are zeroes, those will be cancelled. So there is no need to do all these experiments. So this is just between 2 genes. In general if we have to make an algorithm for a data set, so this is the algorithm. It's something like if we have n query genes and m array genes. So we randomly pick some query genes. Let's say that we want to up to p out of n. So let's say we pick p query genes. For all these p query genes, we did exhaust the method is to do all the exhaustive measurements to get the complete interaction profiles of all these p query genes. Then the union of these interaction profiles should be done. So this is the union. It's the all function of each of the feature vector. So from here we get all the points that are nonzero or have significant interactions. So for the next set of gene set if n minus p query genes or n minus p genes, there is no need to do all the experiments. It's just sufficient to measure at these points. As I said because these values will be ultimately multiplied and at the zero they will be cancelled. So for each n minus p gene we will say some experiment. So here it's like we say 6 experiments. So these are the simulation methods. It was done in MATLAB. So the code was made subject. It will pick out of let's say we had 400 query genes and 64 array genes. It will pick some of the query genes, some of the array genes. The sparsity range of the query genes in the scale of 0 to 1 and array genes will record it. And the mean sparsity will also record it. These are also in the scale of 0 to 1. And each of these experiments were done for 100 times and it was average. I will speak about RAM index. RAM index is in the scale of 0 to 1. It measured the similarity between the previous method and this method. And what it does, it sees how many genes were in same cluster here and also in the same cluster in this method. And which the genes that are in different clusters here are also in the different clusters here. And so it measures the kind of similarity. And in the scale of 0 to 1 it can say how similar are the clusterings there and here. From here we see that if we tend to move towards the total number of query genes and total number of array genes that's this. It tends to be 1 because then the clusters here and there in both the methods are almost similar. But rather if you pick some of the query genes that are very few, then also the RAM index is very rare. It's like 0.97. So this is a picture of the original method that was used in which exosin measurements were done. So these were like the clusters. And this is the current efficient method. So here the correlation values get changed that's where the color gets changed. But the sense of clustering remains same. That means the genes that were having high correlations are also having high correlations relatively. So you can see the clusters here. So most of the elements are in the correct clusters here. And if you look at this is the original exosin measurements picture. And if we rearrange these clusterings in the clustering order of our efficient method, then we will get a picture like this. You can see that most of the clusterings or the clusters are correct as it looks very similar. So this cluster can be applied not only this query array gene or something. It can be applied to any general problem of size in which there is capacity in a process. And also it can be applied anywhere where correlation measurements are done to find out the similarities. So in unknown biological databases, the average capacity of each gene can be known from expert biologists. And then we can proceed to use this method. So the future work, I would say to build up a mathematical model to quantify the things of how many genes to pick and what. So it first from very nice calculation it was like it seemed that it's in the order of law of total amount of elements. But in that case the number of experiments to be done in adversarial is m percent, where m and n are the area of variance in the efficient method is this. But we have to check it further whether it's true or not. And so that's the order of future work. And actualizements is Prof. Robert Novak who is sitting right there. And Gautam, he is my co-guide, he is a grad student in the lab. And everyone in the program, department of bioenergy government of India, in the U.S. science, technology, civilization and to all my foreign friends.