 the I guess the first talk and it's going to be slightly offbeat talk from what you folks are commonly used to. It's going to be about data, but in data used in a very different way in a very different application from what you would normally have seen. And this is about health care of course and more specifically about health issues that result from the genome as opposed to something that comes in by exposure to the environment. So let me introduce the area to you and say, what are we talking about? So we are talking about genomic disease. And so what is the genome? We all know what the genome is. We inherit the genome from our parents. It's three billion characters, three billion into two, because we get one copy from each of our parents. So there are six billion characters in all that we get. These six billion characters together define how we look, how our bodies behave in many different ways. And many of these characters are quite variable from person to person. So at a particular place you might have an A while I might have a G and so on. So there's a lot of variation from person to person. So clearly these characters don't matter as much even if you change that character from an A to a G it doesn't matter much. But there are a few and by few I mean maybe a few hundred thousand of these characters out of the billions that are less sparing in the sense that if you change them then something bad happens. And unfortunately the wrong value in these characters does show up. By virtue of their seriousness it doesn't show up very often but it does show up often enough for us to sit up and take notes. So roughly one in a fifty people or so is affected by these sorts of characters. I'll give you examples a little later. And something that's even more familiar to us is cancer. One in ten people also will get cancer in their lifetimes. That too is a genomic disease that's caused by characters in the genome turning bad in some way. So the question we are asking the business problem to solve is given a patient how do we figure out which character in the genome out of the six billion or so characters is responsible for the condition of that patient. So that's the problem state. A few examples now. You've probably all seen this made big news two or three years ago or a couple of years ago when this came out. Angelina Jolie electing to do proactive surgery to remove her ovaries and her breasts. And why she did that is because typically there's a baseline risk of breast and ovarian cancer is about one in eight and one in seventy if you have no other qualifying criteria. But if you have mutations in a particular gene called the brakavan gene then your risk increases five to six fold for breast cancer and 27 fold for ovarian cancer. So take breast cancer one eighth risk and then five to six fold. So it's like sixty seventy percent seventy five percent risk. It's almost a certainty that that one would get the breast cancer and relatively early age if you carry mutations in that gene. So it's some mutations in the gene are serious and that's what prompted Angelina Jolie to actually take that step of undergoing proactive surgery to reduce her risk of cancer. And she knew this because a mother had passed away her aunt had also suffered from breast cancer and it was when they looked for the mutation in the genes they found it and so she knew that she's going to be at very high risk. So that's the most well known example but there are a number of other examples in our midst and I just want to give you a few. In India for instance we've been doing testing on breast cancer patients and we find that of the people who come to our labs for genomic testing thirty five percent of the breast cancer patients have a risk-conferring mutation in a Braka gene or a related gene. So these inherited risk-conferring mutations are out there in the population. They're not that rare. Even more interesting one in twenty five is a carrier of a character that causes thalassemia. Thalassemia is a disease where your blood cannot carry enough oxygen and therefore you have to rely on blood transfusions every few weeks and you cannot do that for very long though so your life span gets restricted. So it's quite a serious disease and one in twenty five people is a carrier so luckily they don't have the disease but they carry the character and so if they have a spouse who also has their character then the child has a one in quarter chance of having the disease and suffering the consequence. You've seen a lot of blindness around us you will see these blind schools and people who are blind. A lot of that is genetic so one in four hundred to nine hundred individuals has this disease called rectinitis pigmentosa where there's which causes blindness as a consequence. These are all genetic diseases. One in five hundred individuals has something called hypertrophic cardiomyopathy which means that the heart muscle has something wrong with it. It's very silent you'll never realize it but you'll be at very substantially increased risk for sudden cardiac death. If there's a sudden trauma on the heart you often this shows up in athletes who are playing football you run into somebody in the field and then suddenly your heart stops. That's because the heart structure has something wrong which places you at much higher risk for sudden cardiac arrest and you may never realize it otherwise unless you really do medical tests. So this is just to show you that all of these diseases are in our midst and therefore it's an important issue to look at these genomic characters and have technology and have industry that figures out given a patient which is the character that caused this problem. So that's the problem statement as I said and the reason why we are in this business. Now how do we go about doing this and what has this got to do with data at all? This looks like a very blood and body tissue and biology and science and what does that have to do at all with computing with data. That's the connection that I want to forge over the next few slides and then tell you why this is not only a data problem but also a very challenging data problem and unique in several ways. The data aspect of it comes from the fact that anything that is very small and you want to get a window into it you cannot do it using the conventional tools of physics. You cannot see it for instance, you cannot use optics to just see it. You have to use some level of very minute biochemistry and trap things at that level which necessarily the only interface that we have to that level is lots of data. There is no sense, none of our visual or auditory senses will lead us to that level of minute data. Only data can do that. More specifically, when we have to find out which character in the genome is responsible for a particular patient's troubles we have to look into the genome. The genome is 6 billion characters. Every cell in our body has the genome. It's minute, it's tiny, you can't even see it under a microscope. You need very, very indirect methods which will allow you to look at it. Those indirect methods have progressed by leaps and bounds in the last 10 years. In fact, there has been a whole revolution where what was several hundreds of millions of dollars has now suddenly become just a couple of thousand dollars. That's made all of this technology applicable to individual patients. That's why I'm here giving this talk. The way the technology works is unfortunately very indirect and has to lead us through the path, through data. The indirect mechanism is that imagine taking many, many copies of the genome. Every cell has the genome. You can take many cells, you can take hundreds of thousands. You'll get many copies of the genome for any individual patient. You can read genomic copy as a book with lots of characters. Current technology cannot allow us to read that entire sequence and figure out what is the sequence of characters that you have individually. What it will allow you to do is to chop up these, these, tear this book into small pieces of paper, shreds of paper, and it will tell you what each shred of paper has. And now it leaves you with this challenge of taking all these shreds of paper and putting them back together and saying what is the character that caused this patient's problem. Can you tell that from the shred of paper that you have? So as I said, the genome is about 3 billion characters long. When you take many copies of the genome, you end up taking like 100 copies, so you have 300 billion characters that you're in play right now at this moment. Those 300 billion characters are torn up into shreds where each shred is just about 100 characters or so. So you get 300 billion divided by 100, so you get like 3 billion of these shreds. And now you take these 3 billion shreds and start asking the question which of these characters of these billions is the cause of the problem at hand. So that's the problem that we're dealing with and you can see where the data is coming from. What you get through this technology is, I said 3 billion, but order of a billion, so a billion shreds of paper. We'll call those reads for the purpose of this talk. So we get, when we sequence the genome of a person, we get a billion shreds of paper or reads and totally this is about 100 gigabytes of data. So all you get is this big humongous file of 100 gigabytes from one person and you have to now make your way through it. So the first problem, of course, is that all these shreds, you've torn up the book into pieces so you don't know which piece came from where, you have to put those pieces together, it's like a jigsaw puzzle. You have to order the reads back together and that's a challenge. But you could argue that's a one-time problem because any two individuals are not too different in their sequence of characters. So your sequence of characters and my sequence of characters are not hugely different. But of course there are differences and those differences are what we're after because the problem that caused the disease in the patient that we're looking at is because there was one character that became different and that one character is what we're after. But by and large, if you look at billions of characters, the scale of billions of characters, you and I are not very different. We're different in about one in a thousand places. So now assuming that's a small amount of difference, the act of putting these strings of like these reads back together, the jigsaw puzzle back together is a one-time challenge because if you do it for one person, it's a good enough guide to use for the next person because the similarity is so high. So suppose you have one person's genome and let's call it the reference genome sequence, then how would you solve it for every new patient that comes about? Well, you would take every one of the sheds of paper that you get from that patient and search for it in the recipe. It's a simple search just as you do all your text searches, your Google box searches, blue scene searches, etc. You just search for it in the patient. Search for it in the reference sequence. But the reference sequence is 3 billion long, so it's not a small piece of text to index. It's a large piece of text that you've indexed and then you have queries and how many queries do you have? You have a billion queries. So as you can see, it's not a small problem. And if you successfully execute all of these billion queries, you'll get a picture like this where each of these pieces, each of these rods, very rods you see, is a shred of paper that you got from the patient's genome and you're able to place it at the right place in the reference genome by searching for it. And once you place all of these pieces at the right places, essentially you've solved the jigsaw puzzle. Since you are all largely technology folks, a little bit of efficiency would be of interest. So even if you take one millisecond per search, you have to do a billion searches, that's 200, 300 hours, roughly. So you can see that this doesn't lend itself easily to just use common two that are off the shelf. So you need fast indexing of the reference sequence. People who've been through the standard algorithms courses would know the various ways to index text, suffix trees and variations. These are space-wise expensive. They take tens of gigabytes of space and lots of pointer jumping because there's no locality of reference. It's more like a tree-like structure. You use other approaches like loosey news, hash table-based approaches. There you have to know what to index and hash based on and so it gets a little more tricky. There's a whole area of data structures called succinct data structures where it takes, allows you to take trees and convert those into arrays so that you have locality of reference. And yet you have what I call a very nice time-space trade-off. You have a fixed amount of space and after that depending on how much time you're willing to give up on the query, your space can be brought down rather proportionately. And depending on how much RAM you have, you can play all of these time-space trade-off tricks. So it's a very elegant set of data structures. So that's what we use to do a billion queries against a billion-length text. But now our problem is that two individuals are not identical. They're different. So when I'm searching against the reference sequence, I have to allow for differences between what the search query and... So it's not an exact match. It's an approximate match of thoughts. This was working harder on all of these data structures, which we do. I won't go into the details, but let me get further into the application. So once you've done that problem, once you've taken every one of your billion queries and you've looked for it in the reference sequence, you've placed it at the right place, what do you get at the end of that process? So you get a picture like this. The scraps of paper are in orange in the reference sequences in blue. What we're interested in is what is peculiar about you as a patient? Is there a particular character that differs in you as compared to everybody else? And those differences or variants are what we're after. And those come out very cleanly here. You can see that if all of the shreds of paper are saying that there's a T in U, while the reference sequence, which represents a healthy individual, let's say that we've sequenced once, says there's an A there, it means that you are different from the reference sequence at that position. And that's a candidate position for us to focus in to and see whether that's the cause of the problem at hand. So we call those positions variants. And we have a couple of complications there. One of them is it's not as if we have this one copy of the genome. We have two, one from each parent, which means that, and those two copies could say different things. One could have a T, another could have an A. And you'll get a picture like this where half your scraps of paper are saying I have a T and the other half are saying I have an A. So one of those copies may be different from what is commonly present in most people, while the other may be the same. So we call those homozygous and heterozygous variants just to keep in mind that you might get two values that every character not this one. So once you've gone through all of this process of doing these billion queries against a billion-length text database and you've got placed all the reads at the right place, the variants just fall out pretty easily. And now the problem is that each person has four or five million variants where he differs from the average healthy person or any two people differ at about four to five million places. Which of these is the culprit and that begins the problem? So we need to funnel down these five million variants down to one or two and say this one is the cause of the problem. And that requires a variety of different pieces of information to be brought to bear. So scientific literature is one, the structure of the genome and knowledge about the genome. Just as in any business you'll have business rules unique to that business, there's lots of uniqueness in the genome and the structure of the genome. And there is the challenge that we know only so much about all of this. All of this is relatively new. It's been going on for the last maybe five, ten years. And so there's so much more information to come. And here we have a real patient and we have to take decisions in the face of partial knowledge. So there's a lot of guesswork also that needs to go in, educated guesswork and that complicates things. So what I want to do for the rest of the talk is take you through a number of examples and show you how this, all of this happens, how we take, you know, start from a large number of variants, funnel down to a small number, pick the right one, what complications arise, what algorithmic challenges arise in the process and what human issues come about as we go about all of this. A few pieces of information first. Often it makes sense to focus instead of the four or five million variants down to a few interesting regions of the genome. The genome is huge and a billion of characters. There are a few interesting regions in the genome that are more important than the rest. And colloquially we call these things genes. I mean we use this in common parlance all the time. A gene is simply a stretch of characters in the genome where there are recipes which quote the recipe for the creation of a particular molecule. So using these recipes ourselves, manufacture various molecules and these molecules then react with each other and carry out the daily processes of life. So those recipes are embedded in certain parts of the genome. So you can think of this as the data descriptions of the variable declarations in the genome. And then the rest of the genome is some sort of control flow. Often that's more important for us in programming. The control flow pretty much dictates who cares about data or declarations. But here we've reached as far as understanding the data declarations well enough. The control flow is well beyond us at this point. So we're going to look at the data declarations and say where whichever sections of the genome, the so-called genes which have these recipe declarations in them, those are things we understand better. So let's look at them first. So there are about 20,000 genes in the genome. And to confound the matter, these genes are not one contiguous stretch in the genome. They are a particular gene, a particular one of the 20,000 genes is many discontiguous stretches in the genome. So it's this interrupted recipe. So it's one recipe but written out, a staccato and different, different breaks in between. And so you need to know where to jump over these breaks so that you can aggregate the right characters and bring them together to get the recipe of that gene. Once you sort of skip over the right portions and bring the gene recipe together, there's a certain code in which the recipe is written which is, you know, it's a three character code. You have to create triplets of three, group characters three at a time. Each triplet then has a particular coding for a particular molecule. And then if you take your entire sequence of triplets, each triplet converted to a particular molecule and then assemble that molecule, that's what the net molecule generated from the recipe looks like. So this triplet code is something to keep in mind. And clearly, there's only so much you'll get from this picture but all you need to take from this picture is there are triplets of characters in each triplet corresponds to a particular molecule which is described in both bold italics here. So with that background, let's look at a few interesting cases. So here is a case where a family walked in where there was blindness setting in their 30s and 40s. So many members of the family were going sort of partly blind in the central vision. They were losing central vision in their 30s and 40s. And when we sequence the genome, we found a huge truckload of variants and then what did we do? We said, let's take those which line these gene recipes that cuts it down to a substantial number. Within that, let's look at those which cause a change in the recipe. So this is the recipe code and is there a change in the code? So for instance, if you change from CTT to CTC because this T changes to a C, the code doesn't change. It's still the same character recorded for. But if you go from TAA to TAT changes to a Y and that's that may be relevant. So that's one of the rules for funneling that you do and that gets down to a smaller number. So in this family, for instance, what you found, what we found was that there's a C here that becomes a T here. A character C that's normally present in everybody becomes a T here. And what that does to the triplets, so as I said, create triplets of characters and each triplet stands for a particular molecule. The triplet R then changes to a triplet X because the C has become a T. Now X is something special. So all the other characters you see in bold italics are molecule names. X is an indicator for end of program. It just means that your program is ended. The recipe is ended at this point. So a C to a T character change in the patient there was a T, everybody else has a C, that essentially took this recipe, imagine the recipe is a program and brought that recipe to an unceremonial premature end, right in the middle of that recipe it has got truncated and there's a end of program character that appeared right in the middle. So essentially net result was whatever molecule this recipe is coding for, now it's coding for a pale sort of sub portion of that molecule. Now very often since we have two copies of the genome, when one copy of the genome gets compromised in this way, the other copy still holds support and so in the other copy maybe if the gene there is fully okay then it will still do whatever it needs to do. However in this in this family, in the other copy there was a G that became an A and that caused the change in the molecule from a G to an E and these sorts of changes are much harder to this is a very stark change, you know there's an end of program that's come in, this has decimated the molecule. Here there's one character that's become another so this molecule has gotten replaced by that, is this important or is this not and this is a very difficult question that we deal with on a day to day basis as we see patients and there you have to rely on knowledge and literature on scientific experiments that we've performed and in this particular case this experiment had been performed where they actually made this replacement in a biological system and found that this gene was affected by it. So net result, as a result of one copy of the genome being hit by a premature truncation or a nonsense variant as it's called and the other copy being hit by a replacement one molecule replaced by another, another that's called a missense variant in this family both copies of the gene had been knocked out effectively and so this gene this ABCA4 gene was effectively gone in the members in this family. Now what does that do for you? It's actually a very illustrative janitor function, what this gene does is these are our cells which allow us to see they are in the retina in the eye and when light falls on these cells there's a certain reaction that runs and that produces a certain by-product out here, that by-product is trapped inside these so-called bags or discs, these are nature's garbage bags in some sense. This gene helps take this molecule and pump it out of the garbage bag because inside the garbage bag there's no way to treat the sewage here once it comes out here there's treatment available. When the gene goes for a toss this garbage is not pumped out of this bag and so it stays there and then that becomes toxic and over 30 years of toxic exposure it leads to blindness. So this gene is a simple janitor that's simply ensuring that garbage is being cleared every day and so the gene doesn't work for 30 years then garbage accumulates and so blindness affects it. So that was our first example now I'm going to lead you to successively more complex examples and you'll see why data and algorithmics becomes even more important. So this was a case where there was a family where very young in their 20s the heart started to give up heart failure meaning the heart can no longer pump blood it's too weak to pump blood. So when we went through all of the variants and again sub segmented it to the genes and looked at what causes the major change to one of the key genes that plays a role in the heart we found a picture like this. So what does this picture tell you these are the reads and three of the reads are okay they match the reference sequence there's nothing wrong with these three. So in these three reads the two characters A and G that are present in most of us are completely missing. So it is not only that you and I differ in characters where one character could be replaced by another it could be that some characters are missing in me then you went twice with them. So this means that in one copy of the genome two characters were missing in the patient at that time and were those two characters the cause of the problem in this patient was a question that we asked but even before that the question is how would you when you do your searching when you take the billion strings and search for it in the reference sequence how would your search strategy allow for such missing characters. So the search for reads in the reference sequence needs to allow for insertions and deletions and as you all know that's a lot more complicated than just allowing for substitutions it's much more expensive for one you need what are called edit distance calculations and the time for that is proportional to the number of insertions and deletions allowed which is an important thing. So usually you limit yourself to let's say two or three or five or ten insertions and deletions and even then it's quite expensive so we use GPUs here and we've done extensive development where if you do the same sort of dynamic programming edit distance calculations on a CPU whether the GPU is a huge gain trade off 50 times or so that you can get to make sure we can do all of this in good time. Now once we have the algorithmics to actually do these searches allowing for insertions and deletions then the question arises, those two characters that were knocked out what impact do they have on the genome and if you go back to the coding that I talked about the triplet coding in the recipes you have to create groups of three characters now if two characters are knocked off then what happens to the groups of three characters that are developing in a normal person this is how it looks like in the family which suffered from heart failure these two characters A and G are gone which means the next triplet starts here at T and so the next triplet is TGT and the next one is GCA and instead of the triplet AGT you have a TGT which codes for a very different molecule where S replaces by C and then B replaces by A and so you can get the idea what happens completely because two characters have come out if three characters had come out it wouldn't have been a problem one you would have one molecule missing there but the rest of it would still have been in register because you've got three characters at a time but here with two characters going wrong the entire thing gets jitter that's called a frame shift and that makes the whole molecule completely different and now what impact does this have this gene where it plays a role is so you have these heart muscle cells and these heart muscle cells are reacting to an electrical signal generated within the heart and they are throbbing they are contracting and expanding now all of them have to contract and expand together so the whole heart contracts and expands as a unit as opposed to individual cells contracting and expanding and so all the heart muscle cells are connected together by these strong rivets called desmosomes you see these rivets it's a slightly cartoonish picture this is not how it looks actually but it's somewhat like this now the gene that we've talked about was in one of these rivets so it plays a role in these rivets so if these rivets are not strong then the heart will still continue to beat but over 30 years of beating without strong rivets the cells will start to come apart and so wear and tear will build up in the heart much more faster than it would in a normal person so by the time you're 25 your heart the muscle cells are all really loose there's no riveting right so let's get over this so that's what was happening in this person and you see why we needed another slightly more complicated algorithm to do this and then what impact it has on the gene now there's other complications that come up is these two characters were missing from only one copy of the genome the other copy was fine so the question that arises is can the other copy not hold the fault while this copies that can the other copy not do the task is a complicated problem and there are many complexities here which we don't understand sometimes for some genes when one copy goes bad the other copy manages to step in perfectly and carry carry the job forward nothing changes sometimes for some other genes one copy goes bad and things go really bad sometimes one mutation in a particular gene is robust enough that the other copy can step in but another mutation in the same gene is so strong that the other copy cannot help so it's quite a complicated picture and we're still trying to figure out which mutations are resistant to a one copy change and which mutations are not and again we have to the only insights that we really have are from data so I'll give you an example how insights from data and normal people helps us to solve cases in in disease people so just to set the stage for this example in the heart muscles the core unit of the muscle is what is called a sarcoma this is the unit that compresses and expands and since it compresses and expands it has a spring so this long thing that you see here is a spring it's made from a gene called the titan gene this spring compresses and expands and that's how the heart compresses and expands so if you have mutations in this gene where you know like these two characters gone away and so the whole gene gets compromised and in one copy of the gene the other copy is fine what happens then is that a problem we find that in several cases there's a problem in several cases that's not so what you do is you look at large databases of normal individuals whose genomes have been sequenced so we have now about 6200,000 individuals who've been sequenced and whose databases have been archived and when you look at mutations in you know in one copy of the gene in these individuals and say what is the distribution of mutations across the length of the gene you'll find certain regions where there are huge spikes and certain regions where there aren't and clearly in normal individuals if there's a huge spike here that lots of mutations exist here then it seems to suggest that mutations in those regions probably are not causative of disease because lots of normal individuals have that while in certain other regions it's much sparser and very few normal individuals have mutations there and so when an individual with disease walks in and we find that mutation then we can guess that maybe if it's in the region with the spike it's probably it's a false alarm we probably have to look somewhere else otherwise it's probably a candidate so when you're up against a lot of unknowns data from lots of normal individuals aggregated helps you make decisions on for individual patient cases now this whole thing is carried to an extreme in this example where there were two children who who died very early in the first year of life and they were born with several other complications as well and they had so this is a length of the gene imagine it's called the FLNA gene but that may or may not mean anything to you but this was the mutation out here in this part of the gene and the problem that we ran into was this mutation has never ever been seen before in no human has been sequenced has this mutation been seen before if you look at all the mutations that have been seen in this gene that have been known to cause disease and you plot them and the various diseases that they've been known to cause and you plot them along the length of the gene wherever they occur this mutation occurs here and so on then you see a lot of mutations out here some out here but practically nothing out here in this region and this is where mutation that we found arose now the problem with such cases is it's unknown you have a mutation you have no idea whether you know you have well I would say that you have some idea but you don't have enough confidence whether this is indeed the cause of the mutation and yet there's an important decision waiting to happen and what is that important decision two children both brothers died early in this family their family is wanting to have the third child the woman is actually pregnant with the third child they want to know whether this child will also be one with the same problem or not so you go for a year and the child dies that's traumatic if you can see convince yourself that this is the cause of the mutation then you can test for it and abort if needed while ahead of time and sparing all of the trauma and agony on the other hand if the mutation is not there in the fetus then you know the parents can be rest assured that the child is going to be fine and will be normal so there are important decisions writing on this and it's coming down to is this mutation the cause of the mutation or not as I said you in any person you find lots and lots of mutations and then many of them when you look at all of the knowledge available in literature you'll usually find a handful that appear as good candidates and then if you do a lot more sort of informed guessing you'll still be left with a small number of candidates and now you have to take a really important decision based on this but science is not yet at a point where it's given you a proof that this is indeed so you have to use various forms of guesswork so I'm going to show you a couple of forms of guesswork one of course this is ripe for machine learning as you can say completely new mutation tell me whether this is a problem all the known mutations that cause problems and all the known mutations that don't use those as training sites and try to build a classifier and make a prediction of this so that's the field is ripe for machine learning unfortunately the results that you get with machine learning the accuracy are all about 80 percentage as you can see from this picture these are various algorithms along the rows and about 70-80 percent accuracy just not enough in a healthcare setting where important decisions have to be taken if you are in a different setting where you have to decide whether to buy this shirt or that or recommend this shirt or that it's perfectly fine statistically you'll improve your buying outcomes and you'll improve revenue to the company in this case human life's at hand so I think there's a lot of work to be done in making these predictors much better in the absence of that one looks at various surrogate measures and one thing that one looks at is in as I said you look at every human who's been sequenced and say has this mutation been seen there or not and you've reached a point where you've determined that this mutation has never been seen in any human so far of the tens of thousands or even the hundred thousand who have been sequenced well can you get can you look at more beyond that so if you look at the genomes of many many other organisms then that holds a window into like hundreds of millions of years into evolutionary history you can look back and say has this mutation ever been seen anywhere in the last hundred million years of evolutionary history if so it would reflect in some organism or the other if you take all the organisms of genome sequences are known and sort of align them so this is a particular stacking procedure that you take different sequences and bring them in register so that the same character is vertically aligned and then you look at the position where your mutation is and say look at that position over five hundred million years of evolution nothing has changed mice have the same character there as ebra fish have the same character cats dogs gobras everybody has the same character and these two little children somehow had changed I don't know how that happened so rare events happen when you're dealing with three billion characters you know nature is constantly experimenting and so some character the other will be the first time that you see that character change but this builds up some amount of confidence that this is a change that's either evolution has not got around to trying it or if it tried it there was resistance and so those changes vanished they didn't survive and so this is an important character that resist change you can do similar various sorts of guess work I'll go into this for a while and build up sort of enough confidence and say look science can only take you so far but using all of the data and all of the knowledge that we have this is our educated guess and medical communities have to take such educated guesses all the time because decisions cannot do it so let's go further to another case and I'll I've given you examples of how different sorts of algorithms are needed more algorithmic complications next but before that a few examples of genomic structure and why knowing the structure of the genome and bringing those rules into all of this decision making is important so for instance these gene recipes we talked about end of recipe indicators at the end of program indicators and mutations sometimes bring about the end of program indicators prematurely and truncating the recipe similarly there's a start of program indicator and it's typically an ATG and nature has not been clean in this that it has overloaded the start of program with a regular molecule so the ATGs can appear anywhere inside the recipe as well in which case they stand for a molecule but at the beginning of the recipe they start for a start of program indicator so there's some unfortunate overloading that has happened and the net result of that is occasionally there are changes so here is a child who had in the start indicator the A had become a G ATG is the typical start indicator but the A had become a G however if this was a serious problem without a start this gene's recipe was basically rendered useless then the child should have had symptoms which the child did not have and that was a mystery as to how could it be the start is no longer there so the recipe is probably gone so then one has to use genomic structural knowledge and say what could have happened what could have happened is when this start goes away there could be other ATGs in the middle as I said one of those ATGs could serve as a new start and then things could have picked up from there but then which of those ATGs could have served as a new start so what you do is you take all of the genes and align and look at common characters that are present around their start and you see that there's a certain motive to a start the ATG is a start but it should have a gene next to it and a C to the left of it and a C to the left of it and so on so there's a certain structure there there's a particular ATG to become a start to serve as a start so then what you have to do is to bring in algorithms that take all the ATGs that follow and say which of those ATGs have that surrounding sequence of characters that make it a candidate start so here is the color coded the next six ATGs in that gene color coded dark colors means it's a good match to that motive light colors means that's not as good a match so you can see that probably here is where the gene would have picked up a new start picked it up again and so on and then you make all your predictions based on that being the new start and saying this is the new gene that has resulted and what would that gene do now let's come back to an algorithmic challenge again and here is a conundrum that I'll pose and then I'll tell you what the algorithmic issue is so here was a family of where two children had heterotaxi heterotaxi means that the organs are out of place so typically and this is actually something that inspires some deep thought as well our stomach is always on our left and our liver is always on our right for every one of us I don't know how the symmetry breaking happens nature knows left versus right very clearly and left and right looks more or less symmetric when we look at it from the outside but from the inside stomach is always on the left liver is always on the right no matter whether the mother is lying or standing the baby's stomach when it gets created is always put on the left which is very intriguing but moving beyond that in this family I think that gotten haywire the stomach went on the right liver went on the left but this mirror flipping does not happen for all the all the organs some organs were flipped some were not and that caused a lot of anatomical problems and so when we sequence the genome this is the picture that we saw so often you sequence the children and the parents so that you can funnel down that large number of variants down to a much smaller number using the genome the wall three people I suppose we just the genome of the patient at hand so here is what we saw in the children all of the these reads were missing a T in the father one copy of the genome was missing a T the other copies seemed perfectly fine so that would make sense in the children both copies were missing a T it looked like so they had probably inherited this copy from their father and the mother was also probably missing a T from one of her genome copies and the children had probably inherited that copy from their mother and so together they had inherited both copies one in both missing a T one from their father one from the mother and since both copies were missing a T this one missing character in both their genome copies was creating the problem however when we looked at the mother the mother's genome was perfectly fine nothing wrong with it at all nothing unusual about it and that the mystery being now how did the children have the missing T in both their copies from the father they would have got it in one copy but from the mother they should have got a good copy and that should have shown here and when you when you have a 100 gigabytes of data and you have this problem at hand now you have to go back to those 100 gigabytes and say did I miss anything that's where the challenge arises and you have to firstly scratch your heads to formulate a hypothesis as to what do you look for in that 100 gigabytes that might explain a solution to this problem and then you've got to go and test for that hypothesis and say is there some indication in the data that something like this is going to happen so I'll tell you the hypothesis this is a question you can think about if you want but I'll tell you the hypothesis that we formulated the hypothesis was what's really happening in the father is that there's one good copy of the genome and one copy with the T missing fine in the mother there's actually one copy of the genome is perfectly fine the other copy is there's a huge chunk that's missing there's a very very large chunk that's missing and this copy with the huge missing chunk from their mother and if they did this what would happen when you sequence the children if this huge chunk is missing completely so no scraps of paper will come from there no reeds will come from there all the reeds will come from this copy since all the reeds will come from this copy it'll appear as if they all have the T missing while in reality the T is missing from only one of the genomic copies it's just that just because there's a missing T in all of your scraps of paper doesn't mean that both copies of your genome have a missing T it just means that you've read only one scrap of paper so there's some lateral thinking that's needed and so this is the hypothesis that gets formulated but now to verify this hypothesis you need a different algorithm why because now there's a huge insertion or deletion here and you need to allow a large number of indels and your edit distance algorithms are uniquely very sensitive to the number of indels you allow allowing like 10,000 indels if no algorithm will run on a billion reeds and finish in any reasonable amount of time so the standard algorithms become very expensive what you have to do instead is to take a scrap of paper or a reed split it into two pieces and run the edit distance on every one of both these pieces individually so that each one allows for maybe one or two indels and then both of them are independently searched for but then you need to know how to split them and there are algorithmic things that you can do to guess where to split these in the right way and once you split these this guy will go and find its match here this guy will go and find its match there so they will find their match very far away so each reed will have a half that finds its match here and the other half that finds its match out there far away and together when you have many many such reeds where they are split into two and they match very far away in the genome it suggests that the part in between has been deleted and that's what was happening in these children they verified so this required developing a whole new algorithm where you look at these reeds that don't easily match in the genome when you query them they don't get a hit easily then you say where is it that they stop getting a hit at that point somewhere you have to split it so there you split it and now you make it into two independent pieces and look for it independently and when you find it you know that there's a large missing piece so we have maybe five more minutes and so I will go through a couple of more interesting challenges so here is an interesting structural challenge we talked about the fact that gene recipes are not contiguous in the genome and that's a source of great irritation so the recipe is written here and then you skip this and it's written here and you skip this and it's written here but more often than not as I said we look for variants in patients that are present in these recipe regions but occasionally you'll find a variant that's present deep inside a region in between in the so-called intronic region where the recipe doesn't play a role but there's a variant there and this is the only variant that seems to be present in this patient and so you ask yourself there's nothing else and the only variant you can find is somewhere it's not directly in the recipe so can it really play a role could it be the cause of the problem and there again understanding various structural facets of the genome makes a big difference so for instance the one question that you should all ask by now is if the recipe is broken up like this how do our cells know that they have to read this piece then from here they have to jump across all of this and this is not a small stretch this could be like tens of thousands of characters they have to jump across all of this out here and then read from here to here the recipe and then jump across so how do they know if to jump from here across an ocean to the other end 10,000 characters across land at the right place so clearly there must be a lot of that has to be used and there is indeed a structure so I'll hand wave through this very quickly actually the jump happens in two steps you do a long hop to what is called a branch point and from there you take a small step to the next character it so happens that these breaks this is the recipe then you stop at that point there should be a GT and then where you resume there should be an AG of course there are AGs plenty of them in between as well these highlighted in yellow in between here but wherever you jump there should be a GTE where you end up there should be an AG but there are several other AGs which you have to ignore and you have to go to this particular AG and that's the challenge and that challenge is accomplished in multiple steps in two steps you jump to an intermediate character that must always be an A and from that A you jump to an AG and in between that A and an AG must be a stretch of largely C's and T's and this is a close up of that and from there to an AG and in between there should be largely C's and T's and if one of these characters changes like here there was a T that became a G what happens as a result is that jumping from here to the next AG now you've got a new AG that's come up all of a sudden in the middle so instead of jumping to the intended AG you end up jumping to this AG in between and so the recipe starts from here even though it starts from there this is a character change that's happened not inside a recipe the jumping, complicated jumping that happens from portion to portion gets completely misled by these three character changes here and there and that a simple case such as this causes a serious problem of Halosimia which I mentioned earlier is where your body cannot carry enough oxygen in its life similar I'll skip that and get to the last point now there are even more complicated changes that happen in the genome that require a lot more careful algorithmics to be done and a lot more specific algorithmics so I'll illustrate this with one example and you'll see the beauty of all the things that are going on in the genome so this is red-green colour blindness so how many of you can see the number on this slide I don't know how these projectors usually may not work very well can all of you see the number here yeah you probably cannot because not because your eyes are faulty because the projector the colours on the projector are not quite what they are on the screen it's supposed to be a subtle hidden number there subtle play of characters people who can't distinguish between red and green as well as most people can will not be able to see that character while people who can will be easily able to see it but given the subtlety and given the projector transformations you probably won't be able to see it but there's a number hidden in there and almost everybody can see that number if I would show it to you on a piece of paper but about 5% of people including myself cannot see this number and now the question was how do you figure out what caused this problem in the genome because as a problem it's not a serious problem but nevertheless what is the genomic event is a good question to ask and the genomic event is a complicated event in the sense that the way we see colors is that we have sensors for different colors we have a red sensor so there's a red gene for the red sensor there's a green gene and so with the red gene creates a red sensor green gene creates a green sensor the red and green genes are very similar to each other unfortunately so they are so similar to each other that there's just a few places that differ and those few places cause those few places are good enough to differentiate between red and green but if I have a variant or so which makes the number of differences smaller then my ability to differentiate between red and green becomes lower and I can't see these numbers as easy we probably don't have enough time to get into this but I want to show you what is the event that happens what happens because the red and green genes are so similar to each other is that when from the mother that's transmission to the child the transmission of the genome doesn't happen just like that what happens is that the mother has two copies of the chromosome she creates a mosaic from that so she takes a part of one copy and stitches it together and gives it to a child now this mosaic is usually created like this you cut like this and you take this and this and stitch it together so usually nothing goes wrong but occasionally the mosaic gets created like this you cut it here and then you cut it here in the green since the red and green genes are very similar to each other this has don't realize that I'm cutting it here and here they think I'm just cutting it out here in both copies so you take this portion from the red and then this portion from the second chromosome and you stitch it together you stitch the black portions together or the orange portions together and that's what you pass on to a child so as a result you start getting these hybrid genes so this red combined with these three greens so you get red green hybrids so you get these complicated hybrids so you'll see lots of different hybrids here so this is the normal picture here's a hybrid red green gene here's another hybrid situation depending on where these cut and paste events happen so you get all of these hybrid situations and people who have these sort of hybrid genes where you have one red gene but you have the other which is a mixture of green and red the ability to distinguish between red and green goes down and you have to now use algorithmics to determine which of these various configurations occurs in a particular individual so you do various sorts of counts and this I did for my own genome and found that this is my situation I have this this is what most of you have and this is my situation where I have a red gene and then I have a green red combination and then I have some extraneous green genes hanging around and so as you can see to figure this out requires some careful counting and algorithms that needs to be done which I won't get into so overall just to wrap up you know I've shown you a number of different cases and all the complications and all the algorithms that are needed and all the human issues that come up and when we do hundreds of cases a month the picture looks like a genomic water room there's like samples coming in many days going through a lab heavy data crunching going on many different business rules and algorithms being applied a lot of literature mining going on all of this is being done with regulatory compliance because this is health and you cannot just do whatever you want to do and then there are anxious patients and doctors calling and saying did you find the result was there anything interesting that came up so the whole thing looks like a genomic water room so to say where all of these these complex activities are happening and one is trying to scale all of this so to wrap up for those of you are interested there's a book with all these experiences that is in the last stages of being put together so it's available on this URL if you're interested we also run various cancer risk awareness inherited risk awareness programs if any of you are interested in your company just contact our HR and as part of this program we'll be happy to connect with you and with that we'll come to an end and open up for any questions I think we have two or three minutes Questions Thank you very much it was a good talk can you please explain a little bit about the technical part of it as well like because we did hear a little bit of GPUs and hypothesis and statistical models whatever that you had told us but like from the raw data to transforming it and then comparisons all of it maybe a little bit of technology part if possible and second is there's a lot of domain seems to be involved here so initially you started off from domain and then learned the technical part of it or just curious like how do you balance the domain and technology part of this in the early days yeah okay so to answer your question technology stack everything is built ground up because these are all specialized things that's nothing can you lift off the shelf and put it in so it's you can imagine what you would do if you were to build all of this ground up as far as the domain part is concerned yes you need to have a multidisciplinary team doing this you need to have clearly all the computer science and algorithms and the systems to go along with with the not only the biology also the medical translation domain where eventually all of this is going to patients and doctors so it's a complex endeavor it requires bringing in people with multiple different skills together into one way hi this is Vinay I was just wondering how would you what will be the hypothesis to test if there is an alien genome given to you hypothesis test if there's an alien genome suppose there is a non-earthling out there I mean you just compare and you'll see that this is how all comparative genomics is done right you find a new organism you sequence a genome you compare it with yours and you say how different it is and depending on how different it is you get a good view into evolution backwards and say this is what might have happened in history hundreds of millions of years ago so assumption is that if there's an alien genome and let's assume that alien life is also based on the same particular principles that life on earth is and so you have the same and all life on earth seems to be based on the same principles so a test of common origin so in which case you would look at the differences and then that would tell you that maybe billions of years ago somewhere on a common planet all of us were together and then we branched off and then they went somewhere so let's hi this is Sharath here thank you for the presentation so you said scientific literature and genome structures are the key determinants so how much importance the scientific literature has other than the reference genome scientific literature is key because as I said all of knowledge is captured there and unless you you find for any one individual patient you find tens of thousands of variants and even after using all the rules that you can you still have several tens left and unless you can bring in everything that's known in scientific literature and say this is why you can eliminate all of this and it seems super important so we do have a lot of effort on mining that literature bringing it in place some of that ends up being manual because the accuracies needed are substantial some of it is automated but it's hugely important the balcony has been opened upstairs so for people who are not finding seats can please go move upstairs hi are you finding correlations between say ethnicities and diseases for Indian population yet or do you intend to do that there are correlations but you know a lot of it we focus on characters which are somewhat overwhelming in their impact and those tend to be relatively universal across all races on earth however things which are most subtle like diabetes and so on and there is more prevalence in one population or the other so I am not much aware of genetics but there are things like taste acts things like taste acts which is much more popular in Jewish populations right are you finding things like this in Indian yes so it's interesting that in India you do find what is commonly ascribed to all the Jewish populations because they are heavy in breeding at some point presumably a lot of that you find in India as well and India has much higher in breeding coefficients than several pockets in India have so you do find a lot so India is just less talked about but increasingly what we are finding is that there is a lot of those variants are present in the Indian population at equal or higher frequency thank you yeah hi this is so I have one question regarding the investigation as you mentioned requires a lot of customized thinking and the process is not pretty much straightforward so what is the typical time it takes to solve an investigation and how much of customization is actually required right so this is a very good question but the whole process of industrialization is about and what we've been working on over the last couple of years is couple of years ago it would take us like weeks to solve a case today we are down to a manual effort of maybe a day or so and in some cases you know smoking guns just two seconds you can see but on average let's say a day or so is what it's down to can we get it down to full automation a few years from now is a big question because that translates to cost quite a bit because our day of specialists now these are all PhDs staring at data for a day is quite expensive so it's a great question there's a lot that needs to be done though but eventually it will happen in five years or so I think all of these things will be much more automated than they are today thank you