 Good afternoon everyone. It's a pleasure to open up a discussion on a very exciting area that's really powering biological scientists today. The topic that we have is to discuss how deep learning techniques are used in genomic research. And you'll also have a walkthrough to the basics and the fundamentals of genomics which will be followed by a hands-on session here. Aware of what's happening in the world of genomics, why do you think genomics is important? There's something we would like to highlight before I begin the session. A bus word called genomics which is something all biologists, mathematicians, computational biologists today use extensively and have really made an impact by analyzing this data in the case of therapy as well as diagnosis. It's been revolutionized over the past 10 years or so, mainly when the concept of next generation sequencing came into the field. What is this genomics actually? The bus started somewhere in 2001 when the first draft of the human genome was released further with additions in 2003. When we figured out the first ever mystery that the nucleus of every cell which has the DNA which is a meter to one and a half meters long is made up of 3.3 billion base pairs. It's held a lot of information. It's such a lot of information that encoding what this information means to us is a huge challenge which involved computational biologists which is a hybrid of the real computer science people with lot of mathematics expertise. That's when lot of algorithms, physicochemical properties computing came into biology. Today we are in a very advanced stage to understand these techniques. To understand first what a genome is before we even start using the bus word. A genome is just a representative word for the entire DNA that cells. If you look at the cell, a eukaryotic cell which is what you and I have. It's made up of a membrane and a nucleus which contains the entire DNA that we talk about. We think that the entire information that we hold is encoded in the DNA of the genome. The entire structure of DNA which is what I said 3.3 billion base pairs is what you call by the bus word genome. If I am able to extract the protein information and get at the protein I call it a proteome. We will later talk about something called exome in this session. So this large genome, how does it actually make? There is a very big size problem. The nucleus is just 10 micron where we are trying to put in a 1.5 meter long DNA. It first goes through a base pairing with a double stranded helix. It's just a polymer of phosphodiester backbone made up of sugar and phosphate with the inner lining being composed of 4 different chemical structures. These are the 4 letters that we are made up of A, T, G and C. The adenine timing cytosine and the 1.9 which we are made of. A beautiful double helical structure, an asymmetrical structure as you see this is not symmetrical. An anti-parallel arrangement very geometrically and beautifully arranged structure which then falls on to certain proteins called histones. Forms a nucleosome, compacts itself, makes its way into the nucleus. How this makes it into the nucleus and how it is regulated is the most exquisite machinery that nature has designed. Man cannot make a machine to analyze this exquisite design till today. So the genome that we talk about coming to today's problem is made up of A, T, G and C. And we are going to call these bases. And how do these bases sit? If we revisit the last line, the A and the T together can be connected by hydrogen bonds. So can the G and C. The connection between the A and T or the G and C is what I call base pairing. I am pairing the bases together either through a double bond or a triple bond or a stronger hydrogen bond formation so they are not broken. That is the unit by which we are going to measure the genome. Later we can convert it into nanometer scale, angstrom scale, meter scales. But the unit as I walk into the genome is always base paired. That is why when you ask me the length of the genome, it's always said 3.3 billion base pairs. I could then convert it into nanometer when I do the amino acid calculations here. Now how is this regulated? There are multiple steps by which these can be regulated. The first ever step that we are... I would come a little later into why we are using this for deep learning. Why we are using deep learning. I told you one, I would request that you remember two terms called the five prime and the three prime. When I refer to the five prime, I refer to of course space. When I refer to a three prime, I refer to a hydroxyl group here. They come and do the base pairing. Always the adenine A pairs with T and G always pairs with C. This is how the arrangement is made and you see that this goes from three prime to five prime while the other goes from five prime to three prime on a very complementary fashion that they stack together over this on a very beautiful geometry that it can build. So as I make my way through, I am talking about a single stretch of DNA and the whole world knows that it is a double stranded DNA which makes it... Today we are talking about four stranded DNA and wonder DNA from which we are able to make materials also. We come to this during the discussion part. So this DNA is not a static material. We can call it the building block. We can call it the blueprint of information. But this is exactly where information is packed. All that information which has to help us move functionally which helps every biochemical reaction in the cell, which decides cellular pathways, which decides development, which decides diseases, which decides well-being, which decides character, all the information is packed. But how does information flow inside? The DNA cannot be functional explicitly like this. It has to undergo two different processes which is beautifully called the central dog bar. Life. Does DNA can be heard without a mind? Let's see ribose nucleic acid. An RNA is a ribose nucleic acid. First make an mRNA through a process called transcription. Transcribe and then translate to make a protein which are the actual energy units. So when I make a transcription, it happens in the natural way that we transcribe. The DNA as we know is a double strand. There are many types of RNA. It can be a messenger RNA which carries the message. It can be a transfer RNA which transfers and helps in the synthesis of protein. Different RNAs function at different stages. So just for you to understand the workshop, we are laying the fundamentals here. During a transcription process, which is the coding of this DNA strip into an mRNA strip. An RNA is different only in the case where the T is replaced by U. It's a urazine. And from the transcript, it starts synthesizing the proteins that we require for function inside the cells. That process is what you call translation. The DNA to mRNA flow is affected by something called an RNA polymerase. It's an enzyme. The RNA to protein is enabled by something called a ribosome. I do not know how many of you remember when Kirama Krishnan's Nobel Prize was on the crystal structure of the ribosome complex. So there, Marci called Cambridge, where they spent several decades to arrive at the complex structure. So that's not fully derived. So from there, if you look at it beautifully, the gene is just a stretch of DNA. And as I start transcribing and I start walking across, it encounters two sides. One is called a promoter and the other is called a terminator. I should first know where to start and I should know where to end. That's a terminator. This is a promoter which initiates this. But as I walk through, I get through what is called a chronic region, which is a coding region, regions that can code and regions which cannot be coded to a protein or the mRNA. These are called intron, as we know, these are interrupting regions. And these need to be spliced because as I walk through, they are not conserved. They make cross mistakes in making the protein, which is the root cause of diseases. So these have to be spliced by a mechanism called splicing. And then this happens in the nucleus. The picture now moves into the cytoplasm where with the ribosome, with the tRNA, the ribosome walks through this and makes the process pre-RNA, spliced. All modifications are made and on the cytoplasm, the protein is actually made. Jyoti will continue on what is called podons and other things. Before that, the question is now, why would you ever talk about this in connection to deep learning? As I sequence this long strip, I can sequence it at the level where the transcription has happened or at the level of DNA. In doing so, there is a large heterogeneity, there is a huge data and the data is quite complex because of its heterogeneity and the mathematical models or other traditional algorithmic tools cannot detail it. And the mathematical modeling is not something you can quickly learn from because many of these sequencing methods are based on cohorts of 10,000 or 20,000 people. So there is a large heterogeneity of data. If at all there is a big data in any field it is in genomics. It's a very, very data-driven science. Because of the complexity of the data, pair-wise correlations are not easy. Statistical visualizations are not so easy. So therefore predictive models are to be used. And the best of the predictive models which are coming to capture today are the deep learning models. Mainly because the abstraction of mathematical modeling comes and feeds into this genomic data. After Jyoti finishes her part, we'll come back and explain more about where deep learning is applied in this. Thank you. Thanks a lot, Dr. Vijay Lakshmi. So what you have learned is the basic. Many of the things didn't penetrate if you're new to biological sciences. Many things you could capture. So you know that DNA, what we all hold is a kind of a puzzle in a puzzle in a puzzle. The whole idea is how do you, if you want to implement or you want to bring this data into a numerical format so that you can do deep learning and all that, you should have a dictionary for that. So the whole idea is we are trying to condense what you have learned till now. How do we transform this into a statistical data set? What is your classical data set? You have number of rows, number of columns. You need your data to be... So if you tell a machine learning model this, it won't understand. It has to be rightly translated. So all of us carry a dictionary. Our genes or genome, complete makeup of a person. The complete DNA present in your body is genome. So we have a dictionary. That's called DNA codon table. But don't mistake that the same table or the same dictionary holds good for other organisms, for microorganisms and all. It's a different dictionary. But for eukaryotes, we carry only one single dictionary. And this dictionary can be seen in this. So this diagram has to be read from inside out. Yes, for example, I have a sequence that's a simple sequence. Always a protein starts with ATG. As she said, the transcription has to start somewhere and somewhere. So we always know there's a termination codon at the end. And the starting is always ATG. ATG means a methionic. If you just read it, ATG methionic. So we are all composed of... Our proteins are permutation combination of 20 amino acids. The basic structure of a protein is amino acid. So for example, you take an insulin molecule. If you write it in the amino acid fashion, it'll look like that. But you have to take it into your columns. How do you take it into columns? You say 30 glycine, 44 alanines. So you get your variables. Yes, for each sample, you are getting your X variable. That is your independent variables. Taking this, the composition of 20 amino acid. I'll be talking about later about this, but I just wanted to see that how we connect biological information to your computer, computer or your modeling kind of environment. Have you heard of a genetic algorithm? It is inspired by evolution, what is happening to our race, that is human race. See, it is common knowledge that we have evolved from chimpanzee. So over the years, what has happened is each generation, the input of gene pool is changing. We are evolving. We evolved from chimpanzee to what we are today. Our evolution is still going on. So we don't know where our evolution is going to take us. But this process is, you can actually mirror what's happening in evolution in this algorithm. You can talk. I don't want to spend too much time here. It's only thing is, let's look at, we know all of us have 23 pairs of chromosomes. We know our genetic composition. One pair comes from your father, other pair comes from mother. They talk to each other. And see what happens in your lifetime? There are a lot of things you are exposed to environment. You are exposed to many things. So you start your life with your base genome. Then whatever you encounter, that gets encoded in your genome in the form of mutations. There are different things happening in the mutations. Every cell has a shelf life. So there's a lot of wear and tear of your body. So every cell when it is getting replaced, it has to duplicate. It has to copy. So during the copy, there's a possibility that there's some errors there. For example, when you're transferring information from one form to another form, there's always a possibility of error. So the whole thing throughout your lifetime, your cells are dividing. At every instance, there's a possibility that there's errors introduced. The errors can be of different types, different classes. These errors are called mutations. Mutations, deletions. They can be deletion in the sense that some part of your genome gets deleted. So the deletion can be small length, big length, medium length. You don't know. It can be of variation. Duplication is a possibility. During the copy, the same information is copied many times. So that is also a mutation. That is an error. That is not exactly the reference copy what we got. Inversion, the gene gets inverted. See, for example, chromosome 4 and chromosome 20. Imagine a bit of chromosome 4 goes and gets attached to chromosome 20. This, what is happening? The incision is interchromosome. Not just intrachromosome, it's interchromosome. Then there is a translocation. As I was telling that bit of one chromosome goes and sits. So different kinds of things happening in your dictionary. Your gene dictionary, at every point, things are happening. What is cancer? Cancer is nothing but when these errors all together become deleterious, they become, reach a saturation point that it is going to bring in a physiological change in your body. Tumor develops. What is tumor? Nothing but unnatural division of cells. Why it happens? Who dictates it? This mutation dictates. This mutations can be of three types. Mis-sense. It doesn't make sense. If a gene sequence doesn't make sense, you won't have a protein. If you don't have a protein, your physiological activity of that specific protein gets stopped. Yes, so what happens in your body? You have those symptoms, many kinds of symptoms. It can be anything. So silent mutation. The mutation has occurred but the outcome is nothing. It is not making any change to your physiological activity. Those mutations are called silent mutation. Nonsense, very dangerous. In a dictionary, like in our gene dictionary, if you get a mutation which is nonsense, you have a nonsense protein that may just haphazardly make all your reactions, make all your physiological reactions go in a state where you get any kind of, it can result into any kind of failure. What a gene mutation? Your gathering mutation, your left lifetime. But only it gets transmitted from one generation to another generation. Only the mutation occurs in germline. Germline mutation is basically where it's going to get into the next generation. It can be sperm cell or it can be X cell. If there's a mutation in there, it's probably going to get translated to your next generation. I'm just giving you a kind of a, in a sentence way, in a format of a sentence, what happens when there's a mutation happening. See the top sequence is a normal gene. Then you have a point, mutation. So as the man saw the dog hit the can, end. As the man saw the dot hit the can. It will have a consequence in your body. Then deletion. As the man saw the hit the can. The word gone. As the man saw the fat dog hit the can. There's an insertion. It is making a different meaning. Imagine there's a frame shift. The whole genome chromosome starting point from one to the end, there is a meaning to it. If anything happens, anything, the three, I think I wanted to bring in another thing. Three nucleotides. Get into encoded into one amino acid. Right? Imagine one or two amino acids go to one. One amino acid means there are three nucleotides go missing. If it is two amino acid, there will be six words missing. Your frame gets shifted. What is the meaning of the sentence? As the man saw the can you read the sentence? No. So these kind of mutations are occurring day to day. Every time you are exposed to something which is harmful like UV rays, you get cancer when you are exposed to high UV rays. Right? So this is what happens with a single mutation. Insistive fibrosis. Have you heard of cystic fibrosis? Very dangerous. What has changed? One pair of AT has changed to CG. That's it. Single point mutation. And it causes. Imagine all the organs go into a state of shock. And it causes so much problem that person cannot lead a safe life. How do you detect mutation? We understood type of mutation. Let's look how to detect. So we have some packages in like biopattern where you can actually compare sequences and understand where the mutation has occurred and how do you classify them? So let's look at pairwise sequences. There are a lot of bioinformatics tools also. And you can also do that in your python with using biopattern. So imagine if you are matching two sequences. What are you looking for? You're looking for match value. A match, one nucleotide matching with the reference is good. Yes, that's what we want. A mismatch. C should always go with. C should go with G. C should go with G. What has happened there? T. It's a mismatch. So your model should be able to catch this. What happened to gap? There should be a G. What has happened? C. No, no, no, no. G, nothing. Missing. The nucleotide has dropped down. So when you're doing a pairwise sequence, a pairwise alignment, you look for these three. Match, good point, mismatch. Not good point. Gap, very bad. Do you understand? Because our genes are read in triplets. Any word drops in that triplet is bad. So we do the penalty. If you remember gradient descent, you have local alignment. You have global alignment. Same thing you can also match when you're looking for pairwise sequence alignment. You have global alignment, local alignment. And you have to be come up with a score. So what happens is you have list of mutations and you have independent variables. There you can say how many mismatches, how many matches, how many gene penalty, that means gap penalty. So you can also have another independent variable which talks about type of mutation then you do reference sequences. You'll ask where you get the reference sequence. There is a reference sequence. Have you heard of genome project which went on for 20 years to map one single genome? Now we are in such a state that we can actually do one single genome in 24 hours. When we started, we were in that state that we took 20 years to map one genome. But the evolution in genomic research and application has gone to such an extent that you can morning give your blood. Genome is you can easily get your genetic material. Just take 5 ml of blood. Your DNA is extracted from that. You may have different techniques to do duplication and then you can map it. Within 24 hours, you get a gene card where all your 23 chromosomes are given. What is the state at that point? Your body is always evolving. If you do the gene scanning for yourself today, it won't be the same in one year time or 10 days because your body is changing. So you can do you understand now how we can convert what we have done in the sequence alignment into a matrix which forms your deep learning basis. As I spoke about pairwise, we are actually doing only two pairings, two sequences pairing. Can you add a time? Multiple. You can collect blood from 50s candidates and do the mutation for a single gene. You want to see who has a mutation, who doesn't have a mutation. At the same time, you can do multiple sequence alignment. We have some of the tools which I have mentioned here, cluster omega and all, which you can do it. They are bioinformatic tools. They are not your modeling tool but you can get your independent variables. So there is something called t-cafe. So whatever whether you have a protein molecule, RNA molecule, DNA molecule, it will give you. It does multiple alignment and gives you the scores which becomes your independent variable. Today in Hanson, we are going to typically look at one example where we have taken 50 DNA sequences of 50 base length, right? And we have a label. The label we have got it using NSA. So we already have its label data. You take the label data, run your CNN, RNN and you get confusion matrix. You can evaluate it based on that and you can make a classifier. For that gene, you have taken 50 samples. You have made a classifier. When you have next 50 sample, you don't have to label it, right? Because your label or classifier is already ready using the 50. So you don't have to do this essay but taking this as a model, you can label the next 50. Don't you think that is something which is reducing your time, energy, resource? So we need those classifiers. Have you heard of BRCA gene? Has anybody heard of BRCA gene? It's a breast cancer gene. It's the first cancer which we can easily detect today. There are so many cancers, it's very difficult to detect early because problem, what happens is we're still in such a place that we don't know many of the cancers because cancer occurs in a different variation. Breast cancer can have many variations of that. So it's not easy to map one type, cancer type to mutation. Mutation can be many. For example, two people having breast cancer, if you do a same gene, BRCA gene, if you map it, the mutation will be so different. We all differ by variation. How much variation do you think? Each one of you differ by the one person. We all look so different. But what is the change? Only one person. But that one person, you don't know where it is. We are differing from each other only by one person, but that variation is different to different. So do you understand our population carries so much of variation? So many different formats of variations. And that's why we look, our personalities are so different, everything's reflected what we are built on. So this is how you translate the codonic or your genomic information into machine learning data set. You have not yet done it, but for example, this is a very simple case. So Usha will walk through the notebooks, how we have done so that if you have any questions after the notebooks, we'll talk about it. Okay, thank you. Yeah, hi, good afternoon everyone. Thanks for coming. And so for the next 30 to 45 minutes, we'll have hands-on coding session. And I'm not sure if my, I mean, I would like to introduce myself and first of few more introductory talk. Like, it's a great privilege to present along with two people who have achieved a lot in genomics, Dr. Jyoti Mai and Dr. Vijayalakshmi Mohadevan, ma'am. And thank you for, I mean, I would like to thank Dr. Jyoti Mai for helping me coach throughout the duration of this. And Dr. Vijay, ma'am, has been inspiring me to achieve more and more push boundaries and she has been guiding me throughout my career, I mean, in all walks. So thank you both of you. And so, yeah, I'm a principal data scientist for my pseudo consulting group. And I also, I'm also the chapter lead for Bangalore Women in Machine Learning and Data Science. And so, yeah, three parts. So there's three parts to hands-on. First off, Dr. Jyoti would have told you about the sequence alignment, match, mismatch and all that. So that will be the first part. And the second part is where you'll have a, we would have taken a simulated dataset of 50 bases alone, just for the sake of understanding. And we'll be finding out whether the 50 bases will bind to a protein or not. If it's binding to a protein, it is one. And if it's not binding, it's marked as zero. So this will be the second hands-on. And the third hands-on is about an assignment in cancer genomics. So we'll come back to it. And so, yeah, getting excited. This is the demo which I'm going to take. I'm going to take a sample DNA text file. And then I'm going to do both this global alignment and local alignment. So the formula, this code, how it is calculated is this. So it takes the maximum similarity between the sequence X and 5. And the formula for it is how much of it is matching into one and how much of it is mismatching. We don't, because we only want the maximum similarity. We're not bothered about the mismatch and the gap. So this is a formula. I'm going to show you how it can be done using a package called BioPython. First part of the hands-on will be how to convert a DNA into protein sequence. So a sample DNA text file is given. This is your input. So I'll show you how the sample DNA text file looks like. Okay, this is your sample DNA text file. I'm just going to take every three letters and then I'm going to encode it into a protein sequence. So conversion of DNA sequence into protein is the first step. So this is your sample text file and I'm going to walk you through the code. So I'm taking an input as I showed you the sample input how it looks like. That is my input. I'm reading that file. And then this is my code on table. Dr. Jody would have explained you about the code on table. So you're taking this is like a blueprint for every three characters. What does it map on to? So this is my code on table and I just generate a protein sequence. So for the sample sequence, which I'm just printing out the sample sequence and the generated protein sequence for that. So this is the first part subset of the notebook. And then I'm installing a package called BioPython. She would have showed you about the sequence alignment, global alignment and the local alignment. So in BioPython, we've got packages to do it. You just have to pass the sequences and the function does the job for us. So you're importing a package called from Bio. You're importing pathways to and there is a method called align.globalxx and you're passing X and Y sequence to it and it will automatically give you the score. And so this is a score three for all these three things. It's giving you. So you have to call the method and then pass the X and Y sequence. The job will be done. And so for local alignment, again, it is align.localxx where you have to pass the X and Y sequence and you'll be able to get the score. So this is the first part of the hands-on. So I'll just quickly summarize what has been done in this notebook. You're taking a sample DNA text, then you're converting it to a protein sequence. That's what I did first. And then she had explained in the slides how to calculate the global alignment and local alignment. She had put the formulas and she had given the theory explanation for it. We're using a package called BioPython and the method is passing the X and Y sequence as the input parameter and output as we're getting the score. So this is a walkthrough of the first notebook. So the theory part, the introduction part, she had explained in the last slide what is the hands-on we're going to do. So we have to... So we are taking a simulated data set of a DNA sequence of 50 bases. I'll show you how the input looks like now. So this is a simulated data set of 50 bases. So what I have to do is I have to classify whether this will bind to a protein or not. If it's going to bind to a protein, I'll have to classify it as 1. And if it's not binding to a protein, it's classified as 0. So this is the job. I'm going to use deep learning architecture, a basic CNN architecture to get this job done. So there are four steps. First, we curate the data which we have got now. And then we select an architecture. The architecture which we are showing for the sake of simplicity to be able to handle, accommodate all types of audiences. We have chosen a simple model and we have used simple hyper-parameters. But the code will be shared with you. You are free to experiment with all types of hyper-parameters. You can do all the complex optimizations above it. We're just giving you a simple walkthrough. And first getting the data, then choosing an architecture, then evaluate and train, and then interpreting. These are the four steps. So I'm just printing out first five, you know, lines of my data. This is how it looks like. So what I'm going to do is, first step, I'm going to convert the labels into integers. That is my first step. And once I convert it into integers, I'm going to do one-hot sequence. For each of the nucleotide, ACTG, you have to convert it. You have to make one-hot sequence of it. So there are two steps into it. So this block of code, first it converts it into a sequence of integers, and then it converts it into a one-hot sequence. So I'm just printing out the sample, how it looks like. Yeah, ACTG. For A, it will be hope everyone understands or not encoding. So for A, wherever A is present, it will be marked as one, and for rest of the things, it will mark as zero. And for each of the nucleotide, you do that. For the value it is present, it will be marked as one, and rest everything will be marked as zero. So first step, you convert it into integers, and then do the one-hot encoding. This is the format we want the data into. So you do the pre-processing up to this step. So you're going to select the architecture. As I already told you, this is a very simple architecture. To accommodate all of you, we have used a very simple CNN architecture. So there are four layers into it. The first layer is a convolutional layer, convolutional 1D layer. We have 32 filters of size 12 bases. And then we have a max pooling layer, 1D layer. So we're just using it as a non-linear down-slampling, but it's not necessary, but it's an optional layer. And then flatten layer will flatten the results of this max 1 pooling 1D layer. And in dense, we have two dense layers. The first dense layer will convert it into a 60 tensors, and the second one will convert it into a two possible output. The output is zero or one, right? Whether it will bind to a protein or not bind to a protein, it is zero. So this is the architecture, simple convolutional 1D. And then we have the max pooling 1D for providing the non-linear down-sampling. And then flatten layer for flattening the output, whatever is coming out of the max pooling 1D. And in dense, we have two things. First we convert it into a layer of 16 tensors, and then you're again converting into a tensor. The final output is zero or one. So this is a very simple convolutional 1D architecture. Which I'm going to use for the problem. So I'm training my model. I'm giving my input train features and train labels. And then printing out. I'm trying to calculate the precision recall value. This is my confusion matrix. How much of it is correctly predicted and how much of it is a misclassification? You can get the insights from the confusion matrix. So this is the end of the first notebook. So basically what I'm doing in this entire notebook is the overview of it is, I'm taking a DNA sequence of 50 basis and trying to classify if that will bind to a protein or not. And what I'm trying, using is a simple 1D convolutional layer. So for this process, I have to convert it into a format which I can give it as an input to the convolutional layer. So first of it is an alphabet. So you have to convert it into an integer. After you convert it into an integer, you have to apply one hot encoder. Now you'll get the format in which you can feed it as an input to the architecture with pre-design. And then the output of the architecture is going to be 0 or 1. So this is a very simple task. With Dr. Jyothi explaining this task, we have done it in deep learning. So it's very simple. Now we'll go for the other second example, So the problem statement is a cancer tumor will have, you know, whatever type of cancer you have, breast cancer or whatever skin cancer, whatever types of cancer you have. It will be caused by several types of mutations, genetic mutations. So whenever you have a mutation, I think the theory of mutation and all of it, both Dr. Jyothi and GMAM have explained. So when you have a mutation, there is some genetic variation which has caused the mutation. So for a clinician to be able to manually go through it and identify it, it's going to take a longer time. So we can do it very easily with deep learning. So they have to like go and read through different text and then classify whether the class of mutation is going to be 0 to 9, sorry 1 to 9, and 9 is for the deletrious mutations and for class and as you go up higher 1, 2, 3 are not the dangerous ones. So for a clinician to be able to do this, it's going to be a lot of process. We can do this very easily in machine learning. So I'm going to show you what is the input and what is the output we are getting. So we have two data sets. We have taken a syngenomic data is always huge. So we have taken a very subset of it for you to be able to understand the hands on. We have taken a subset of it, but this not generally looks like. So yeah, first I'll start with showing you how the input data looks like. So we have two training, two data sets which we have to combine. So this particular CSV file has got three columns. One is the gene and then there's a variation and there's a class. Class is the target variable 1, 2, 9. So 7, 8, 9 are the deletrious type of class and 1, 2, 3 is not that dangerous. So our job is to predict this class. So this is the input. So you've got gene, you've got variation and then you've got the class. Class is going to be your target variable. So there's one more Excel sheet. For each of the class, for each of the ID, you have the text available. So whether what type of mutation it is, you know, something which will give you more insight to classify it as a dangerous or non-dangerous. So you have a text attached to it. So these two are two different data sets. First thing, our job is to merge these two data sets. That's the first job. So ID is a common variable between the two data sets. I'll just show the data sets again. So this is my first data set which has the gene, variation and class. And ID is a common variable between the two data sets. So for each of the ID corresponding, there's a theory part text as well, describing more about it. So the metric which we are going to use is multi-class log loss and confusion matrix. So we are going to differentiate between the performance of different machine learning algorithms using these two metrics. So any time you run a machine learning algorithm, you have to be clear on what is the metric, what is the input you're getting, and what is the output you're getting. So I import all the necessary packages installed for the process which I'm going to execute. Then I read the training data and just I print the first five rows of the first data set which I showed you. Then I print the first five rows of the second data set which I showed you. I'm going to concatenate both of the data sets, which now as we corresponded, these two data sets are merged together now. I'm ensuring I bring the class to the end because that's my target variable. So I'm dividing my data set into test training data set. Then I'm putting maybe 60% for training and I'm keeping across 20% for validation and 20% for final testing. So to avoid overfitting, I divide into three groups of sets. Test, training, validation, and tested. That's what this piece of code does. So one more thing is when you divide into, generally it is like sometimes it is test or training data set. When you see, you'll have to see the distributions of the various classes. You have classes from one to nine, right? So the distribution should be equal in all three data sets which you have kept across. Test, training, and the validation. The distribution of the nine classes of things should be, all the nine classes should be the same. So that is something which you have to put caution into while you divide into three types. So I'm going to visualize it and I'm going to show you the visualization that in whatever I did, all the three distribution is same. So this is a distribution of class labels in training data. If you see one, classes one to nine. So the dangerous ones are the eight and nine which are like, we have fewer number and we have the highest one is the seven. And so this is my distribution of class labels in training data. So I'm going to say we have the similar distribution of class labels in validation data as well. If you see eight and nine have the lower number and the seven is the highest. And we are checking out the distribution of class labels in test data set. From now what we are concluding is we saw the distribution of labels in training, validation, and test. We found out in all three data sets we have the similar distribution. So now we are good to go. This is a check where you need to put caution on and ensure that it's proper. So before you start testing it out on different machine learning models, you just want, if you want, this is an optional step. If you want, you can just do what is a random guess, like how good is a random prediction. So you shouldn't have a model which does perform worse than a random prediction, right? So if you want to do that, you can do a testing of how good a random model is. So this is an option, so I'm skipping this part. So a little bit of exploratory analysis of how many types of unique genes we have and what are the categories so that we understand the data set much more better. This is for your better understanding of the data set. We are going to just do a univariate analysis of gene feature. This is exploratory analysis for you to get a closer connection with your data set before we jump on to the actual machine learning model, okay? So the two important things we are going to do, one is univariate analysis on gene feature and we'll also be doing an analysis on the text data which we had, a later part. So what does this show is number of unique genes we have is 231 and the different categories we have BRCA1 is for breast cancer, and there are different types of things and how many of it is actually present. So this is what you get to know from the training data. Then you just plot a histogram, visualize whatever we saw it before. You're just doing a histogram of it, number of occurrences, and you have 231 genes, right? You're just plotting an index of the gene will go up to 231 and the number of occurrences of it, you're just drawing a histogram. This is just an exploratory analysis before you get to the model. You're doing a cumulative plot about it. That's it. There are two features which we will be using for comparison is alpha and log loss. So the one we should pick up the one with the lowest log loss, we should look for the lowest log loss value and the corresponding alpha of it. So, featureizing the gene, there are two steps to it. First one you do response coding and then you do a one-hot coding. You see the context of this after one-hot coding is not lost. Sometimes you might lose certain context after doing one-hot coding. So you have to ensure you have not lost the context. So that is what I'm checking in this entire piece of code if I've lost any context of the data after doing one-hot coding. So first thing is the response coding and then I'm printing out the distribution of features there, size of it, and then I'm doing vectorizing it and then I'm printing out the shape and the details. We have not lost the context. I'm printing out the alpha and log loss. This is about the univariate analysis of gene feature. We will also be doing a univariate analysis of the text feature as well. So there we were doing response coding and one-hot encoding for the gene feature. So for the univariate analysis of text feature we're doing as response coding and TFIDF. After you do response coding and TFIDF you just have to ensure you're printing the alpha versus log loss and then you have to ensure the distribution of words is also, the distribution of frequency of words across the three data sets are also similar. So that also is one of the caution check that you have to do. So this part of code does that. And now you finally come to building machine learning models. So there are several models which we have tried out this, you know, tried testing this use case. And first we started with, I'll just show you a quick summary of the several models which we tested out. The metrics we are comparing each models in the beginning I had told you. Multi-class log loss and there was another metric which we were confusion metrics. These are the two metrics which we used to compare across the several machine learning models which we tested on. Which model in our analysis perform better in this job? Which is it? Yeah. Text data I'm doing a TFIDF for it. You are converting it to some numeric format which you can do the analysis. You can use the text data to feed the text data into the model. Yeah, you're using text data as features as well. Text data doesn't have nail labels at all. So there is a particular part which I've skipped on. There is a little bit of more of an explanation which has to be done on the text data part which I've actually skipped on for the sake of some time. We'll be sharing the code which we'll be able to see what are the different processes which we have done the text data on after TFIDF there's a longer part involved as well. So there is what from our analysis we found out stack classifier performs has got the log loss and it performs better than the other models. And so for this use case we took some set of huge data and we tried out various machine learning algorithms. Our job was to find out from which class... We have to identify the class labels which of the class labels it falls under. So that was our target variable and we tried doing this at several machine learning algorithms and we found out stack classifier performs better. So we'll go through again. So this was the second use case which we were dealing with. So we had three walkthrough of the code. The first thing was to convert a DNA into a protein sequence. That was our first thing and then doing a pair by sequence alignment global alignment and local alignment. The second use case which we dealt with was given a sample DNA file. You have to predict whether it binds to a protein or doesn't bind to a protein. If it binds to a protein it's going to be one. You're just doing a simple classification using a simple 1D convolutional architecture. And then the second use case which we dealt with was personalized cancer diagnosis. So in this our input was like so if you have a cancer tumor there would have been several mutations which would have cost it. For each of the mutations there would be a genetic variation. So a clinician has to manually identify from which particular mutation the genetic variation has come. So we are simplifying the process of a clinician by using a machine learning algorithm. The clinician doesn't have to do manually all the searching and reading of all the documents and text and all of it. So this process is simplified by using all these applying machine learning algorithms. This is an input and this was the output. And so the output target variable was R class labels. So 1, 2, 9. So different class labels. So 8, 9, 7 are all little dangerous and the up above the order 1, 2, 3 are not that dangerous ones. So these are the three case studies and walkthrough of the code. We will be sharing the code in the proposal. So it is like a very short time to grasp all of it. So we will share the code. You can go home and have a read through it and if you still have doubts you can always get back to me. So the end of the hands on and over to Dr. Jyoti there is a little bit of theory part elsewhere left. So I will give it to her. Thank you, Ushaan. The whole idea is you understood what we do in a genomic research. If you want to use a deep learning network or deep learning modeling any kind we know how to translate biological information into numerical information and do with all this. But do you know why we are doing? Is it so necessary? The whole idea is why? Have you heard of a term called precision medical input? Yes? The whole problem here is whatever we are doing at present it is not working. In case of cancer itself we have different types of cancer and different types of cancer has different nature. So we are still unable to classify the cancers properly. We are not able to... That means whatever we are doing it's not affecting. It's not very effective. And in the coming days we have to take a different strategy. What is the different strategy? Till now we go to the doctor whatever is happened to you you go to the doctor, he or she diagnosis you take the medicine. But what we have seen is it's not effective. That treatment is not effective. The second is personalized medicine or precision medicine. What is precision medicine? It's nothing but you are going to take data from all possibilities. Your case history whatever has happened to you from your childhood till now your biochemical test you do all kind of biochemical test you check your urine you do blood samples you do check for thyroid take all that information if you have broken bones you would have taken X-ray take that X-ray to put all of them plus your genomic data you put together so what happens this data when it goes to a doctor doctor can make an informed decision he is looking at your history clinical history he knows what has happened physiologically he is adding genomic data or genomic knowledge to it so he is coming up with a medicine which is so personalized that it will be effective and it is a better treatment so the whole idea why we want to convert genomic understanding into your machine learning is the whole idea has to go towards precision medicine so tomorrow everyone as I told you your genome can be mapped within 24 hours genomic sequencing technology has come to such a place where you can map your genome within 24 hours you know how much does it cost to map your genome anybody has any idea hundred dollars it has come down to hundred dollars so very soon it is going to be such affordable situation where you can map your genome and you will carry your genome in a pocket it will be like a gene card you go to a doctor everything is stored in your card the doctor is going to put it in the system he will read and he is able to easily diagnose for whatever trouble you go he or she can easily diagnose because the person has on his or her fingertips these things all what has been collected what has happened to you so don't you think that is a right way to treat let me give you an example my father had a chronic heart attack so the doctors gave blood thinners that is what happens when you have a blood clot in your heart they give blood thinners my father was of that one person population who react allegedly towards that blood thinner and you could not be saved but if this technology or this knowledge was there or this state was there maybe he would have not the whole idea is how do we smoothen, streamline our treatment strategies the way we diagnose the way we detect disease early detection of cancer which is what we are striving today we don't at present we don't detect cancer very early we detect cancer at the last stage or the stage where the person cannot be in a stage to be treatable so we want to go to a place where early detection do the early detection see there are lot of markers genes if you just study those genes throughout your lifetime you can say that who has the propensity towards cancer who has a propensity to some other disease the whole idea is we want to revolutionize the way the treatment happens and the way forward is doing this and why at this juncture we are doing all this why is our genome sequence strategy or technology is to come to a certain stage that you can collect the data so a lot of data your data science needs lot of data lot of variation each of us has so many variation we are collecting variation every day to whatever we are doing any activity so we have to map our genome and this is the way forward and the data genomic data is very complex and I told you what we have spoken is only one percent of what we have learnt there is so much complexity to our DNA so our data sets are broad they are not tall, they are broad they are multi-class, multi-label so how do we do then when it comes to a hands of molecular biologist or clinician you are not able to read there are too many columns can you do that we used now we used bioinformatics tools but bioinformatics tool with the flood of data coming very soon bioinformatics is not accurate it is slow and it is not going to give the person has to spend lot of time it is tedious so how do you reduce the workload of that clinician and how do you make things faster so that the information is in the hand computational and you are getting the benefit out of it so that is why we are doing all this and this is a high time that all the computational we need the computational expertise or the brains to put in this see for example I am a molecular biologist till now from past two years I have started data science I am not that strong when it comes to modeling and all that but I am very good in my domain so if somebody supplements to my domain expertise it is much faster and it is going to be a smoother so the whole idea is this is the high time we do this and I hand over to Vijay Lakshmi she can speak about what is going to happen and how in future AI is going to contribute those general research Thank you Usha and Jyoti Thank you Usha and Jyoti for the perspective you laid on deep learning and the exercises that you took the participants through I would like to conclude with few layers of work that the different sectors and genomics have been addressing and why it necessitates deep learning I could talk from the experience of us doing genomics every day in and out at the level of clinic and also the difficulties and analysis of the huge data set that we give so you sequence cancer genome or a disease of a neurodegenerative patient or a neuropsychiatric patient or a patient of a rare disease it is a human genome at the end of the day and after filtering you get at least 30 GB of data after removing the noise and all you end up with a good volume of data now in the data what are the actual challenges which people face and where does it invite machine learning experts particularly the ones with deep learning we will look at it in the case of biological and clinical interpretations and again in the case of healthcare interpretations the different types of sequencing that happen today are when I want to sequence the entire genome I do something called a whole genome sequencing I just want to look at the changes in expression at the gene level I do something called an mRNA sequencing because as we saw we look only at the transcribed genes and look at a portion that is coding when you want to look at areas which are particularly if I want to learn the effect of influence of the environment influence of the environment and say a big project which goes on with children of alcoholics how much of that stress transfers into this genome IT employees how much of stress is encoded in the genome these are environmental responses I look at where the regions where proteins bind to DNA and pull those regions alone and do a sequence if I want to do hereditary studies like psychiatry is it in the family shall we take three generations and siblings and then sequence I look at the coding exonic regions we saw exons and introns then we do something called exome sequencing today we are going moving ahead to something called microbiome which is the microbes in the gut to take and sequence the microbiome and look at whether they will have effect in the brain disorders look at a multiple levels of sequencing that we do and the multiple sizes of data that we end up with in all these things why do we actually require this and where are we advanced today we do this at the level of bulk cells I take a tumor collect all the cells together from the tumor and sequence because there is a heterogeneity even when a pathologist cuts and gives a tumor not all cells are cancer cells there are non cancer cells there so there is an intra-tumoral heterogeneity there so the result that I get is a statistical bias of how much of them really represented the tumor quality so the population is heterogeneous will the mathematical model or the algorithm and the statistical definitions that I have been making so far in genome analysis be useful at all no can I look at the situation and predict models based on the genome I do not want a pre-biased model today if I were to predict the transcription binding site or something I use the position weight matrices I do the HM and the hidden Markov models we use Bayesian algorithms significantly but can something be model based which can develop and learn on the basis of the data set that I am giving which is complex I cannot predict more so now as we are moving at a single level from cell to cell there is a huge variation so I have to adapt to these variations with deep learning methods are very easily predicting Dr. Jyoti was talking about precision medicine human it not only supports the doctor for the diagnosis or the therapy like what she said about the clot many people with lung cancer are usually given a drug called Cetuximab which does not respond to people who carry a mutation called KRAS that is why the mutation detection exercise was given in cases of those in cases of detection detection of specific variants that a person takes or more particularly if I scan a specific population today we are moving towards population genomics based studies usually if you look 10 years back the studies where on Andamanis tribes Alaskan populations and different tribes and different parts of the world to see how their genome differs now we are taking them in the clinical perspective also because there is a heterogeneous mix of populations across has somebody carried a mutations on to the other areas if you look at Mizoram today the highest percentage of gastric cancer is recorded in those areas the understanding is that there is a condensed tobacco usage stories now every house will start telling such story but where are we actually to use these particularly when we are doing population based studies in India there are very big studies the Nemhans and the institute NCBAs have taken up a huge study on brain disorders, schizophrenia patients bipolar disorder patients patients who have obsessive compulsive disorder because the brain is not invasive and for a psychiatry patient imaging is not something which is going to differ so that you can identify something so from the peripheral blood mononuclear cells exomes are sequenced they choose families where at least three of them have this disease so there is an inherent familial history here can I use the familial history information in order to predict specific variants that are prevalent in India because some of them respond very well in such cases there is a property of responders versus non-responders who will respond based on the genomic profile is being indicated that's a very interesting project if you look at something that the Tata trust has initiated with UCSD and all over India the main problem in India they have taken is malaria they are trying to address diseases malaria is now much under control in pockets but everybody knows that it is about the mosquitoes that bite so if you have to go we have to go and kill all the mosquitoes but can one re-engineer the genomes so that they are refractory to biting and cause this is what genome editing people are working on while one has to do the this is the anopheles mosquito which gives this right so people are now taking pockets of mosquitoes across different parts of India and are sequencing them getting their draft and say what is specific about Indian mosquito that you should target there so when we want to edit genome there are very many genome editing techniques which have come and called the CRISPR-casinoin strategies but one should know one can edit knock out knock in put something in alter the genome but I should know what the effect of this knock out or knock in will be on this that is where at the single level RNNs are heavily employed because of the heterogeneity that they can easily pick up while CNNs again I said are used for smallest stretches RNNs are exclusively people generally say you used deep learning other cases in schizophrenia based studies where the onset is even at the level I have a collaborator in Exeter at UK Prof. Jonathan Mill who uses this for identification of very very different variants across twins of the two children born one has schizophrenia and one doesn't have and this measurement happens at the 30th day of the child's birth so huge projects that are happening are about sequencing new born babies that will be the future so Harvard has taken up something called seekaboo where they have taken 44,000 children it's called the sequencing a baby for optical outcomes where one can detect abnormality in syndrome which would show up at a later stage but they are inherent in the genome about hearing disorders and all these they have migrated from the traditional Bayesian and other classifications so in particular population genomics where we start sequencing huge populations DLL is very very applied but where does it actually impact healthcare in the future one prescribes a drug we do not know what is the outcome of a particular therapy on a patient there are several patients who do not if you look at the chemotherapy cycle after two cycles the patient starts responding because there is a drug resistance which sets it so pharmacogenomic approaches invest a lot in deep learning approaches these are where the future of the healthcare as well as biological findings as well as other related structures are moving towards and I would like to finish with one particular case there were set of people from liquor company here they had gone I thought something exciting came up four days back on how black liquor is converted to pulp and others where they have taken up huge amount of microbial sequencing I thought I will end up there but today's genomics is not only about genome sequencing it is about proteomic data proteomic data, genomic data NLP on the prescriptions given the video processing the text processing and the audio processing particularly in neuro degenerative diseases the patient is captured and the audio and the video are captured so there is processing of multi-omic data at the biological level and medical observations plus imaging so this is where deep learning comes to interpret them exactly as a multi-dimensional data structure which is used in diagnosis and healthcare so we might get into using more precision algorithms to arrive at this so far as the human body remains complex technologies will continue to evolve thank you very much for example you have to release somebody's genomic data out in the public you have to have the consent right at present the universities and research institutes work with the consent saying that they do research ok so the whole idea of privacy and people can see the thing is you cannot paint in genes or genomic sequences right so the whole idea is there is still a gap how much to release to the public and how much how much so there are so many things the policy is not right in place so the whole idea is what is released to public is bits and pieces but you have lot of data available in open source you can start working on it once you have then you have shown that you have a right kind of model you can always talk to the university or they will they will eventually release it but that will take time you have seen outside that is 1000 Geno there are so many places Kaggle has some data sets no as I no as I told you the policy is not in place and there is a question of privacy, confidentiality all that major research institutes like CDFT, CCMB and all that they are doing lot of work but nothing is released in the paper out of it that is it but you do not have easy access to Indian genome huge effort has been initiated in India to release the genomic data from Indian population one effort that has already started is to sequence the typical Indian genome so genome India of 10,000 individuals is now being taken up by 22 institutes and this will be out in the public for everyone but the disease type of data that you asked about when published the current rule is that you have to deposit the raw data and then do it so there are consortia for the schizophrenia like data there is something called common mind to look at brain disorder data there are consortia to which you need to write and explain why you need it it is a simple thing you need to say that it is a not profit and that you want to work on this experiment and that you will not request the patient privacy data then you have access I work on such data which is submitted, got enough the moment you spot a typical Indian genome database you can always write to the consortium there it does not have to be in India if it is published it goes to the consortium look at encode look ENCOD encyclopedia DNA elements and others which have quite a lot of data from Indian subsets because today we cannot publish without depositing the data more so from the cancer side also in TCGA all that we need is to get the regulatory approvals for access it is becoming very easy without spending the repositories to come through where even the AI data AI based data.com is being put up there is a national machine being developed there that is very very important because we as researchers feel the immense need of such data sets at all which are not generally available so I go and make request and get this from the consortia atleast the consortia exists today it was not there before 2-3 years that we could just not have it will be an interesting work but you do not even know whether it is from male population or from female population who are the responders now we are getting such data but yours is a very valuable suggestion the government is appearing to be listening now let's see one more thing many of this institute has a good repository they have for example CCND which is in Hyderabad it has looked at different tribes different population people from different origin they have taken samples and they have that whole genome sequence for them and it is there with the the institution is with the institute but it is not going to give it to public until and unless there is a big why or big something is going to privacy first thing is privacy there is a difference between image and gene there is a difference image it's okay everyone if you take an x-ray of somebody the image will look the same so there is a difference between image and gene as I spoke about one person difference each of us differ from the other person by one person that one person is different with reference to other people there is so much of variation in gene pool gene pool is nothing but the time point the whole world population whatever gene is that is the gene pool the gene pool keeps changing yearly at every point so the whole thing is they are not going to release easily as she said there is a possibility that we can request but again the format will be in that format that it is not easy to convert into your modeling so you will need a domain expert person so that you can do that so we have lot of data but it is not easily available but you can always start with your thousand genome there is so much of things available bits and pieces and NCBI NCBI also has for gene wise you have gene wise data but the whole genome you won't find so you can actually at least raw material is there for to start we need the right models if the right models are there people are going to release the data but it is not like any other stream where you get data easily like financial or banking or any other this biological information is something you know data is oil yes and genomic data is yours right the big question is whom the gene belongs to you may carry the gene I have taken your blood and I have sequenced it does it belong to me now see there are lot of ethical questions which are very unclear so it is going to take some time for everything to be out in the public even if it is released it will be released in bits and pieces you won't get the complete data even for a baby if you need to take blood you should have an audio or video consent so the regulatory authorities are very tough on those things but the UK has a model when a patient walks into the clinic and signs an informed consent for a biological material they obtain the consent saying this material could be used for any other study also but will not be released so there is a person a cardiologist uses it for a cardiogenomics project while a hypertensive measurement is used though the blood taken will be maximum 10 mL for these patients these regulatory approvals are very much in place but the privacy of the data as you said is to be coded and distributed that data doesn't exist but very soon we will find them in place not only way to get this because data science community is very much required to crack and get deeper into this because people are now they pay publications in nature and sell people are revisiting using deep learning approaches to see if I could make out some more information from these so that will open up a huge opportunity not very far from now