 Hello, everyone. I'm Joseph Rickert, Director of the R Consortium, and I'm very pleased to have the opportunity to introduce Robert Gentleman, our next keynote speaker. Most of you know that along with Ross Ihaka, Robert created the R language. You may not know that after that, Robert embarked on a career as a computational scientist holding leadership positions at several prominent institutions, including Harvard University, the Dana-Farber Cancer Institute, Fred Hutchison Cancer Research Center, Genentech, and 23 at me. Recently, Robert is back at Harvard, where he has been appointed Founding Executive Director of the Harvard Medical School Center for Computational Biomedicine with a mission as described in the Harvard Press release to, quote, conceptualize the scientific vision for computational biomedicine across HMS, end quote. It would take a Wikipedia page to do justice to Robert's accomplishments. Here, I would just like to mention that while pursuing his extraordinary career, Robert has continued to be a passionate open source champion and community builder. Much of what we now take for granted in the R world, including the R core group, the culture of cooperation, the bioconductor project, and the R consortium may very well not have happened without Robert's vision, support, and personal touch. Please virtually welcome Robert into your homes and offices for a view into the future. Robert. Great. Thanks a lot, Joe, and thank you to everybody who's coming. As Joe said, I just very recently switched jobs, and I have taken a new position at Harvard in the medical school. It is my first sort of foray into that, and I didn't know when I accepted the invitation that I would actually be sort of more on the medical side and less on the pharma side. And it has sort of unfortunately caught me in between jobs. And when you leave in industry, it's really hard to take your work with you. So what I have put together for you today is some of the ideas that I'm hoping to sort of start to develop some observations and things that I think are really important for how we are going to bring computational science into the practice of medicine. And that is largely what I'm hoping to do at Harvard. And maybe a way to conceptualize it. And how I think about it a little bit is that with our, I was involved with a fantastic group of people and we built a tool that's just sort of broadly useful for everybody. It doesn't matter what sort of area of science or pretty much anything else you are working on. If you need to do computation, then R can be a vehicle for you to do that. And then when I was at Harvard, the first time in the School of Public Health, as Joe said, I worked again with a slightly different but little overlapping group of people to set up the bioconductor project. And there, what we tried to do was to really say, you know, what happens if you move into a discipline and say, what we're going to do as a group is, you know, agree on data structures. It turns out if you agree on data structures, then it's much easier to share code and algorithms than if you don't agree on data structures. But, you know, if we build that infrastructure, will people come? And by and large, that too has turned out to be a true statement. And then as I went on through the Hutch and Genentech and 23andMe, what I found was that for institutions or organizations to succeed as units, there is a need for a very large sort of piece of data infrastructure that lives in the middle. And so today, I'm going to talk a lot about that and why I think that's really is essential for the future of computation in medicine. And I hope by the end of the talk, I've convinced you that there are some, you know, very interesting problems and big opportunities here that are, as I said, centered around how do we put together infrastructure that supports, you know, institutions or organizations that have sort of shared needs and goals. So what's what's the future going to look like? And in sort of where do we think computation will come in? And so, you know, I think I've seen a few talks here and elsewhere, right? It's really unlikely that most doctors are going to use R or any other computer language directly in their practice. But what is highly likely is that doctors will rely more and more on algorithms to guide care and lifestyle decisions, so sort of wellness opportunities for their patients. And in order to use those algorithms, we're going to need very large databases of well curated data that will help to support those decisions. And I do think that the well curated part is absolutely essential to this and even more essential to the problem of developing new algorithms and new methods. So I'm not a believer that, you know, just a big database that you sort of search through and find interesting facts where it has little curation and not much annotation is unlikely to be the resource that gets us very far in medicine. I could well be wrong about that. And there are certainly people with a different opinion on that. And in order to do things, they're going to need a whole bunch of different inputs, though some of those inputs will come from things that we're starting to use already, like smart watches, some way of getting at food consumption and the variety of foods that people eat. And, you know, these turn out to be really important facts for understanding wellness and sort of general health, but they're poorly collected in general. And if we don't have the data, we can't use the data in order to help make those predictions. And if the, you know, the information that you need is how much you exercise or how much fiber you need, that's the biggest, you know, sort of lever for understanding risk for a disease, then absent that data, we're not going to get there. Genetics, which I've spent the last five years learning a lot about and there will be a fair amount of that in the talk today. And genomics are going to come in, I think they are reasonably straightforward. We are close to doing them at the scale they need to be done. And then, you know, the other sorts of pieces of information like medical history, current health status, your blood values, etc. But I do sort of want to push a little bit in this talk, you know, it's not going to be, is it EHRs or self-reported data? It's going to be both of those. You have to have inputs on, you know, activity and food and risk, etc. To be able to make algorithms that are highly predictive because for certain diseases, it's lifestyle that beats everything else. If we're going to rely on algorithms in our medical practice, then there are some things that we absolutely have to do. And that is to have these large, well-curated data sets in which the algorithms have been trained and tested. And the data sets have to be comprehensive and they have to cover the range of patients, diseases and exposures likely to be seen in practice. And I know there is a new sort of thing out there of, oh, machine learning has bias, but it's not really the bias. It's the fact that you're trying to use a machine learning algorithm on an input that hasn't ever been seen before. And that, you know, should be something that we catch earlier and sort of, you know, in some sense, it would be better if algorithms said, hey, this doesn't look like data that I've ever seen before and I'm not going to give you an answer. And sadly, many of them are implemented as, I'll give you the best answer I can, even though the data that came in is a long way from any of the data that was ever trained on. And we'll see some real-world examples here where that has turned out quite bad for at least a range of people. The instruments that are being used by the clinician need to have near real-time access to these appropriate data resources. It doesn't have to be quite real-time because the doctor doesn't have to sort of fill in the risk information at the minute you come into their office. If it's a regular checkup, they could have that pre-populated and have time to do it. But if this is going to get used in ERs or any place where you don't have time to realize that you're actually going to require emergency services and pre-populate these things, we'll need to make sure that the resources are there and fast enough and available enough. And then most of what I'll do here is sort of give you some ideas of things that I think will happen in the near future. Many of them are already underway and they're certainly not my ideas. Other people have been expressing these and I'm just going to try to sort of lay out where they are. And then what has to happen if we're actually going to get these into the clinic. And that seems to be a place where at least in most of my conversations, there are big gaps in what people think is possible and probable. And I'll try to give a bit of guidance on what I think is needed. In genetics, the most important formula is genotype plus environment equals phenotype. And mostly what we can do here is just think of phenotype is really wellness. How fit are you? Do you have a disease of some sort? Are you at risk for that disease, etc? Those are our phenotypes. And so if we want to predict phenotypes from data, the data has to have genotypes and environmental data. That data consists of things like exposures, behaviors, and other things. And as I said a little bit earlier, I think genotyping is the easiest problem to solve. We can talk about that. Environmental data at scale is harder and doesn't reside in a single location. Or at least it doesn't now and will need to if not have it in one location will certainly have to have it in places that are easily accessible from a single sort of doctor's office. One thing I found at 23andMe is if you want to get phenotype data on people at scale in the 10 million 100 million things, it's not going to come from EHRs or EMRs because there's no set of hospitals that use exactly the same system that have you know, on the order of 100 million patients, that's just not not going to happen. But we can set up a survey and send it to 100 million people and get relatively uniform data back. And then for a lot of the environmental data, that's really going to be the only option you have there. There aren't other ones that I'm aware of. Continuous measurements are great. But they're expensive. You're going to store a lot of data. And if you only want a small part of it, then it's better to get everything from a group of people, figure out what the important number is, and then from the whole population, even though you could have a vast amount of data, you can sort of resolve that down to a relatively small number of inputs that you'll need. And then improvements as we go forward in machine learning and, you know, there are lots of people talking about these, they're just going to help us build better and more interpretable in the actual models. Alright, so just a little short digression on exposure. So one of the places that we really lack good exposure is lifetime exposure. And those of you with your cameras on can see that I'm getting a little bit older and hopefully a little bit wiser, but mostly older. And so diseases of aging start to make me sort of want to study them. And it turns out we have a reasonably good idea of how to establish lifetime exposure to smoking, right, pack years has been studied for quite a long time. And at least in most people's hands, it does a reasonably good job. We don't have good estimates of lifetime exposure to alcohol. And we don't even know if that's important or not. We don't know whether it's how much did you drink in the last few weeks that puts you at risk for a disease? Or is it really this lifetime exposure? Exactly the same thing for exercise, fiber, and so on. So there are lots of things that we know it matters more how you behaved over your whole lifetime than it does how you behaved last week. And EMRs, EHRs and even self report are very challenged for us to get that kind of information. So an area where I think there's a lot of room for improvements in data collection and just understanding how to do things is in this sort of lifetime exposure, which will I think ultimately be essential to get good models. And then, of course, in lots of these things, causation is almost impossible to establish. It's very hard to get the right sort of experiment out of things. So we're going to be stuck with things that are more observational. And, you know, a simple example is when you look at wellness, we know that elderly folks that socialize a lot tend to be healthier than the ones that don't, but we don't know whether it's the socializing that helps them become, you know, keeps them well, or the fact that they're well that allows them to socialize. And causation would be nice because once you have causal relationships, then it's easier to understand and interpret the models. And that's in some sense where genetics again, sort of wins out a little bit for folks that aren't aware of it. You know, genetics is essentially causal. You have the genotype at birth. So everything else has to be after that your exposures can't really cause your genotype. Alright, so genetics, I believe, will be playing an increasingly important role in medical care. It's often prioritized by risk. So whether you get genotypes, and we'll see some examples. So for example, getting genotype for a BRCA mutation, you know, is an expensive operation. It's not routinely carried out. It does tend to be routinely carried out in people at high risk. But that's, you know, basically, a cost issue. So this is can we make the cost benefit work out? And then the observation that most of these alleles, you know, BRCA mutations are really quite rare in the population. So if we sequenced everybody, we do an awful lot of sequencing at high price, and get relatively few individuals that we could sort of provide a benefit. But the thing that's not always being considered in that is that, you know, we can now genotype individuals at a very, very low cost. It's you know, in the range of $100 a person, you use a microarray, you get, you know, genotypes at a backbone of about three quarters of a million variants. And then you use a process called imputation, which is reasonably cheap again. And so, you know, around a total cost of ownership of about $100, you get 40 million plus variants that have been reliably imputed. And if you wanted other variants, if you had some variants that you specifically wanted to impute better, and you knew that they were in the population at some frequency, you can sequence a small number of individuals, and then you'll start to impute those variants really well as well. So it's very amazing tool and it works quite well. If we did that, then at birth, you would know all of these pharmacogenomic variants, you'd know adverse drug event variants, you'd know variants for some of the rarer diseases. And we could also, and I'll talk about these later, then given polygenic risk scores, which which are reasonably easy to develop for thousands of diseases, you essentially could at birth be, you know, sort of given this piece of information into your medical record that said, you know, here's the drugs that you probably want to be really careful with, because you'll have an adverse event, these are the diseases that you're most at genetic risk for. And at some point when the data catch up, because getting the gene by environment interactions is going to be the hardest part, we don't measure the environment as well as we want. And any interaction estimation requires larger data sets than main effects, as all the statisticians in the audience know. So this is going to come along a bit later. But we can today tell people an awful lot about what they're at risk for at birth, right? So once they're born, genotype them, and then you know an awful lot about what the risks are going forward. And, you know, as I said, in 20 years, I fully anticipate we will then be able to start telling people things like, you know, this kind of exercise is good for you. This kind of diet is good for you. This kind of diet might be really bad for you, et cetera. So maybe just a little divergence into the human genome. Again, most people probably know this. There's about three and a half billion nucleotides. We sequence it and attempt to measure it at every location. That cost is still in the thousand dollar range. You can get it down a little bit. But for medical level genetics, it's probably close to that. And then the cost of owning that, these whole genome sequences are actually very large. We don't store them as efficiently as we could. And so you're, you know, you're then looking at, you know, another fifteen hundred dollars. Whereas for genotyping, it's much, much less than that. And so it really does turn out to be pretty cost effective. So we have twenty two autosomes and then the sex chromosomes, as most people know, women have two copies of X and men have one copy of X and one copy of Y. And variants in this variation in the sequence of the genome is actually associated with human disease. And that's what we do when we study GWASs. And I'll show you some examples in a little bit. It is very hard in general to go from a variant in the genome that associates with the disease to knowing exactly what gene and what happens to that gene and how that thing causes the disease. That's something called the fine mapping problem. And that is what most pharma, certainly what I did at 23 and me in a little bit of what I did at Genentech. But almost all pharma is obsessed with trying to find solutions to this fine mapping problem because that's the way that you find drug targets. Once you have a target, then you start to develop the therapeutic against that target. The reason that sort of imputation, this idea that only have to measure your genome on a fairly limited set of places and then I can sort of fill in the values in between by this process called imputation is really because of this sort of notion of crossing over. So while you get half of your DNA from mom and half from dad, it's very rare that any chromosome in your body is actually the same as a chromosome in either of your parents. Because in sort of making the egg and the sperm, there is this sort of recombinant crossing over. And it's, I think it's three to five crossovers, two to three crossovers per chromosome, per meiosis. So each of your chromosomes, it looks like the ones that your mom has, but it'll be part of one, like she has two copies, it'll be part of one and part of another one that are sort of stuck together. But that means DNA travels in in clumps, right, you have long sequences, right, of DNA that go together, and that the changes in those are not that rapid across the population. So if I see some markers in one individual and I see exactly the same markers in another individual, it's very likely that the pieces in between are identical as well. And we can we can actually estimate that we can do the predictions and not not only fit the models reasonably well, but know when we're not fitting them well and know, you know, exactly what you need to know, which is, hey, this imputation didn't work very well, and we shouldn't be relying on it. Complications and we'll see them as they come up in a bit. So I just want to introduce this. So you know about it, this thing called linkage disequilibrium. That says there's a strong association between new two nearby variants. Essentially, if I know one of them, I know the other. And in a statistical sense, this just causes confounding. And it makes it hard to identify the likely causal variant because there could be in all the cases for a particular disease if they're if it's genetic, they may all have the same fairly long piece of DNA that they've shared that they've inherited from an ancestor. And they will be identical at every variant along that. So you can't say it's this variant or some other one. You have to test every one of them individually to try to understand which one is the likely causal variant. And that's part of what makes it really hard to say here's a location in the genome that associates with the disease. And I know this variant is is the causal one because it doesn't have to be. It could be anything that is in linkage disequilibrium with that. Other challenges. We don't have a perfect reference yet, as I'm sure everybody knows. There's lots of variation that hasn't been counted for. We really don't deal with structural variation at all well. We don't deal with the sort of trinucleotide repeat sequences particularly well. And many of those associate with diseases as well as the transposable elements. Again, very challenging to do. The other thing that we need in genetics is this this thing called phasing. So while you have two copies of each chromosome when we sequence them or genotype them, those two sort of get mushed together and the algorithms are getting much better at being able to take that data and come out and say one of your chromosomes has this sequence all the way along and the other one has that sequence. And again this is really important because it may be essential that you have two variants on the same strand of a chromosome to get the defect. If one is on one chromosome and the other is on the other chromosome that you have, you may not actually have the defect. So if we looked at it we need to know how to face those people to understand your risk. What does this get used for these days? Testing for variants that affect drug efficacy or the cause adverse events. I'll come to those. Testing for rare variants that are highly pathogenic and highly penetrant. Though the BRCA is familial hypercholestremia, G6PD, which I'll talk about, Huntington's disease. And often here as I said before you sequence the implicated genes and then try to interpret the variants that you find. But you can't always. If you show up with a BRCA mutation that nobody's ever seen before, nobody knows. They can't say yes this is likely to cause disease. They would have to then either do some bio chemistry to get at that or wait until they get more people with the same mutation and say oh look most of the people with this mutation did get breast cancer so we seem to think it will be pathogenic. Companies are starting to move this into the direct to consumer market and I'll show you some data from them as well. So here's a drug efficacy one known set of variants that are associated with that a gene called SIP2C19. One of these alleles is known to associate with reduced effect of a drug called Clopidigral. And in fact just about two weeks ago the FDA finally announced a clearance for 23andMe to include this in their pharmacogenomics report. And the most important thing of this is that the labeling has been modified so that it doesn't need confirmatory testing. So now if you were a customer of that company you'd be able to use that genotype that you got off of the array that you purchased potentially for Ancestry or other things actually as a basis for getting the dose of this drug set up properly. And we'll see more of this as we go forward as the FDA becomes more comfortable that people can take this data and move forward. The costs then drop very dramatically as I alluded to earlier and we can start to see how genetics can sort of play out more. Here I'm showing you G6PD deficiency. This is X-linked so on the X chromosome. If you have a variant in this gene then basically you'll have a deficiency and what happens is that for certain drugs and if you eat fava beans you can have a life threatening reaction. And if you look down here at the bottom hopefully people can see my mouse moving around a little bit. It's not very prominent in Europeans and I did a little bit of math to say well if we were looking at the U.S. it'd be about 45,000 European Ancestry people in America that would have risk for this which is not very many and you can see why it doesn't make sense to test everybody for it. But as soon as you look at African-Americans you now see that it's about a 15% allele in the African-American population and then testing is really valuable. And so this sort of really shows you that there's a wide range of variants of the risk alleles across populations. If we then go a little bit further and a new variant has shown up this last month in in 23 Amis report and now this variant actually is much more interesting to people who are Middle Eastern or South Asian. And the allele frequency in Europeans and Africans is very low unlikely to be to benefit from widespread testing but we now know that you know these data can be shared more broadly. And then a recent paper from a group at Harvard in the medical school Arjun Manray basically they looked at misdiagnosis for hypertrophic cardiomyopathy and you know again the result section here is kind of surprising you know where we got to but the mutations that were most common in the general population were significantly more common in black Americans than in white Americans. And we would not have misdiagnosed those people if they'd included even a small number of black Americans in the control cohorts. Misclassification here was actually quite bad because it resulted in treatment that the patient didn't actually need. So again we do need to get better at getting genetics broad across people. And what happened there is basically this linkage disequilibrium problem. So in Europeans we saw a variant and people that know that must be the one we think that's the causal variant. But if they looked at Africans they would have seen that they couldn't have been because the one thing that you sort of do is say well if I have a rare disease and you're going to tell me this SNP is causal for that rare disease then the frequency of the SNP in the population has to be less than the frequency of the disease. And so in that case that didn't work right? Africans had a very high frequency of that SNP but they didn't have a higher frequency of the disease certainly not sufficiently high for that SNP to be causal. Alright so let me skip to polygenic risk score. So these are basically weighted sums of the estimated effects of all the risk alleles. So we'll show you a picture in a second and try to bring you along. Much of the focus here has been on individual diseases as I said earlier. Once we have you know a thousand polygenic risk scores you'll be able to get which diseases you're most at risk for. And again I think that's a useful fact to to help people with from from a genetic perspective. That's not all of your risk of course. So here is a picture on the left side of a Manhattan plot. So this is basically the the visualization of the test of association for variation at that at a particular locus in the genome right across all of these and in this case it's for depression. And so you can see some parts of the genome larger things indicate stronger association. So there's some variants that are strongly associated. Others that are weakly associated. And with a PRS what you do is basically just start over here at the far left of chromosome one and you add up the the sort of odds ratios the estimated effect sizes for the risk allele across all of these things. And that gives you a way to score every individual in the population. So if you have a risk allele at that locus you get the risk score and then if you have the the protective allele then you get a different score. And now for every individual we can just sum across the whole genome. In this case I think it again is it's about 40 million or certainly 20 million variants are captured here. So we had a pretty good idea of what's going on. Seth Catherson at MIT and now at Verve Therapeutics is one of the the folks that's really pushed this. And so now what I want to do is just put it push you up to see these sort of pictures. So this first histogram density plot here this is for each individual in a study population you find out what the polygenic risk score is and then you just plot them from the lowest score to the highest score on some value. And then what you do basically is you you come here and you say well for the really high scores how many of the people there so if I take all those people and then I go ask did you actually get this disease what frequency of people in that group get the disease relative to somebody who has a different risk. And if that's sufficiently high then that opens the sort of door for us to start to think about well maybe we should screen them differently maybe we should treat them differently. And you know as I've tried to outline here in the far right here what's what you see this is a pretty common plot here we just take the percentile of the risk score and for each percentile we plot the average of you know the number of cases for individuals with that score divided by the total number so just telling you essentially what the risk is in each of those and you can see that as you get to high score your risk for or the prevalence of cardiovascular disease in those people is very very high and so that tells us we can go back and do things. And then here I just outlined that people are starting to look at how do we do performance and diversity in populations but our big problem is we don't have that many African genomes we don't have that many South Asian genomes and as a result we're not able to always you know extend these models the way that we want. In another paper from the same group they basically just you know sort of demonstrate here maybe I'll just focus on this image down here at the bottom. If you just look for monogenic drivers for risk you find you know for these genes it's about one and two hundred and eleven people for these genes it's about one and hundred and fifteen right so we're just going to do the monogenic stuff we're in the one and a hundred one and two hundred range as soon as we go to a polygenic risk score we seem to be able to get ourselves into a one and five that have a two-fold increased risk for cardiovascular disease we'd like it to be you know three or four fold so there's work to be done there but you can see how this really changes the game quite substantially so it's great to look for individual drivers I think that's a worthwhile exercise but it doesn't impact that many people it's better to start adding in polygenic risk which is both you know here it's really talking now about genetic risk but eventually we want to bring in phenotypes because for many diseases phenotype matters far more than than genotype we want to make sure we're getting a good estimate of risk but hopefully this gives you an idea of what the opportunities are in in this particular space and why I think it's important and why I think you'll see it coming into hospitals and treatment in the nearest future okay so now what I what I've done is sort of say here's a whole bunch of things that you could kind of do and I have a couple more examples around single cell and some of the imaging data but it's the same idea we can generate large data sets we know they're informative about you know the state of the patient we think they might actually be informative about what treatment should you give that person but how do we get to that right because the data themselves don't tell you what to do we need algorithms and we need to to to be able to get training data sets that are sufficiently large and sufficiently general to be able to do that and that turns out to be really expensive right so in my time at you know basically again most of harvard the hutch genentech etc there are lots and lots of AIML solutions that come along all the time in general the problem is not finding a new algorithm that is not what has slowed the field down at all and it's not finding interesting problems as I've already outlined here there's a ton of them that could be worked on the problem the biggest challenge is how do we get large well annotated clinically relevant data sets that can be used to test and train and validate these models that turns out to be the most challenging and the most expensive part of the whole operation and so that's the place where I want to try and focus over the next few years most of my efforts in that there is it's essential that you're working on a clinically relevant problem and we'll talk a little bit about how to get there with data that's sufficient to address that problem and then of course if you actually want to use your algorithm to change the clinical diagnosis or the clinical path for somebody well you better make sure that you have a plan to get clinical validation so here what I've tried to outline just in this little flow chart is you know the sort of standard typical workflow that would go on in here identify the problem that you want to work on then find sufficient data training and testing data that you can use to do stuff identify usually a set of AIML approaches because the standard approaches really let's try five or six different things because we really don't know which one's going to work well and then go through this iterative learning and performance optimization until you finally get happy and now you've got model estimates that you like you look at those you basically try to understand how they work and right there's some stuff to do in that and then if you do want to change clinical practice you then hit this clinical validation step and you're going to have to say well how do I take all of this information and give it back to a doctor or put it into a medical device in a way that would allow people to make decisions in real time and then ultimately it gets used and you know what I've tried to say at least in my experience is that identifying the problem the AIML piece and the fitting of the model these are weeks to months type operations but the two red boxes can really be months to years it can be very challenging to get a big enough data set and as a result we often see approaches that are published that are fit on models on data that were too small for us to be able to extend them they give us hints that the idea works well but they don't get us to the place where you could do clinical validation and then the the planning around the clinical validation step is often not included in this and people get very excited about you know doing everything up to the the fit the model and obtain parameter estimates typically that's what gets published in the scientific literature but there is this big step that we need to overcome if we're really going to change how health stuff happens in doing this you need a big multidisciplinary team you need clinicians to ensure clinical relevance pathologists to help you select the right cases compute computational scientists computer vision scientists to make sure that the thing you're trying to do is even remotely tractable and that you know they believe that you can get enough data you can process it in finite time and and yield outputs you when you want to identify your approaches again you need computational science with experience intuition to identify reasonable approaches and then try make sure that you try complementary things and and do all of the the standard stuff that the machine learning community has has developed this creating of these data sets that's why I said this takes a long time I'll show you some examples at the end here or at least one example with what I think is a pretty interesting problem once you fit the model then you need you know data visualization experts biostatisticians etc that help tell you what you learned and what it means and then clinical scientists that will take those outputs in a brand new cohort that's never been seen before by anybody and say yeah this seems seems to work in this cohort and this is how well it works and then finally after all of that you still need to have some plan for a trial that will do clinical validation how do you go out into real world conditions and show that a clinician that's using the tool that you think they should be using will get results that are similar to those that were done in the lab and then you can basically develop this as a lab develop test under a CLIA or CAP rules all right so that's sort of that part let me spend a few minutes how much am I going to five or six minutes here at least talking a little bit of a single cell transcriptomics again a fantastic opportunity out there with some hope of changing medical care certainly understanding diseases so in oncology the notion there is we're going to identify important subsets of the tumor with specific defects and that will help guide treatment immunology identifying subsets of the cells of the immune system like in k cells or macrophages etc that are not performing as anticipated my own bets are on immunology right now I think there's a lot of reason to believe that single cell immunology especially in perturbation type experiments they're going to lead us to insights of how the immune system is dysregulated in disease that will be super valuable and in part that's because most of the cells of your immune system are quite happy being single cells and you know we basically you don't have to do much to them to get them through these assays we know a lot of the important chemokines and cytokines that you'd like to do the perturbation experiments on and then see what the the readouts are so I think you know I think that'll be easy I think there's some challenges in oncology still in all outline a couple of them and I think in the next slide and then neurobiology is is really fascinating how do we study the brain right defects associated with neurons the defects in brain specific immune cells such as astrocytes and here it does seem that that the field is going to have to move into a world where you're trying to get sequencing data imaging data etc so you get changes in behavior in the right context right the brain needs to be surrounded or intact and in the right environment to be able to study most of the the real things that are going on so neurobiology I think will be fascinating but harder because we're going to have to marry at least two technologies to get it going so sampling for all the statisticians it's always a bit of an oddity to me you know we're sampling data from some things so think of a tumor somebody's going to take a biopsy they're going to take that biopsy and they're going to look at the single cells in it and they're going to say okay here's what's going on in the tumor and hopefully most of you realize that would be remarkably like you know somebody going to Iowa asking people what the most important crop is and coming out saying we know everything about the United States and what the most important crop is it just doesn't happen that way statistics has a long tradition of developing good methods and models and survey sampling and it's you know essential that some of these take over the same will be true in in immune cells and amongst these the really big challenges are that it's often the rare populations that are causing the defect and the rare populations are really hard to both sample and to identify and so you know I think there are challenges here that that have to be overcome I do think we have statistical methods but I haven't seen people you know sort of put it putting those two together quite the way they need to and certainly in my experience tumors are remarkably diverse they're not clonal in in generally basically evolve in different parts of them evolve in in different ways at least in many many tumors and so I think this is a place where yes maybe but there are challenges that won't won't easily get resolved if we don't address them head on and then you know as I think everybody knows the issue one of the issues here is that we really only get a small number of express transcripts that tend to 15 percent of your lucky on a single cell of the mRNA that are actually expressed can get detected there's certainly goals out there to try to get that better and you know it is an engineering problem so I'm confident of that and and even within that we still only get short transcripts and again their goals to get long transcripts and full length and those will be important and we'll see them as they come along and then there's some other options I'll show you one image here at the end when I get there all right so here's just again a graphic from a paper but the idea is pretty straightforward we have back in our database somewhere a whole bunch of data that we took from healthy people this is what things are supposed to happen in healthy people maybe it's blood maybe it's kidney maybe it's a piece of vein so we can understand cardiovascular disease etc and we've studied the the healthy folks and that's in our library and then maybe we've studied some people with a disease and that's in our library too and so we have a patient sample we now basically if we can in you know real time or or weeks rather than months or years basically do the sequencing do the profile profiling can we start to understand well what's wrong in this particular patient that from that would distinguish them from healthy and do they look like other patients that we knew and then ultimately what we'd like to get to is oh and not only did we see what's wrong with them but we see oh this gene here looks like it's up-regulated maybe that's the right drug to be using right that's one of the hopes whether we get there or not will be determined in the future and then the same here this is a slightly older paper 2017 but you know again when you get into these people are really well aware of this an important hurdle is the pre-processing analysis and storage of the data we're not going to be able to make therapeutic decisions or anything if we don't solve that problem and so that's a I think really a grand challenge for us to do so the way that we do it today with the bulk RNA sequencing find the average clone and then if we see that they have lots of you know her two amplification then we know how to treat that with single cell as I said the hope is that you would then look at individual cells and potentially think of a more precision approach to it will not only have some cells that have this mutation but they might have other cells that have a different mutation and that could ideally lead you to combination therapies the challenge there as I said before is if you miss a small fraction of the cells that have an even more important location you might wipe out a lot of the tumor but then it just comes sort of roaring back with with the you haven't taken away the the parts that you can control by drug but there are still other parts of the tumor that had different mutation that were resistant or that developed resistance and then there's you know as I alluded to earlier some of the pre and post treatment so if I sequence something pre treatment and look for you know we've been sort of circulating tumor cells post treatment can I understand what my treatment did and that again is really important if we can get it right because it feeds back into B and A and says look you know we now see that this drug has this effect so we really want to use it in people that have a defect that that aligns with it and then this one you know is a really cool one one of my new colleagues at Harvard has been working on this project site and here what they do is they basically take an image from a microscopic example of a fixed tissue on a slide they have a mechanism and there are many similar ones around as far as I'm aware where you will stain the slide with antibodies that have fluorescent tags conjugated to them and so you can get a color for PCNA and you get a color for beta-catenin and you get a color for DNA and so you can figure out well where's the DNA in the image where are do we see this gene PCNA expressed and where do we see beta-catenin and if you can do that for 10 or 15 different proteins right that's what's being detected here is proteins on mRNAs then you know each one gets you four colors so now you can imagine like okay now I have four different images that are overlaying on each other if I can align those great now I basically for any pixel I have no notion of what color is that pixel and what color are the neighbors of that pixel if I can go one step further and take some of the new image processing algorithms and say well here are all the cells in the image right now what I can do and that's sort of what you see over here in the middle two pictures is basically I segmented this image down to the cellular level and I didn't do it somebody else did but segmented down to that and we have an x y position in the image for each cell so now what we're going to do is just go over here and translate that into a level for that particular substance here and so now what you have for each of those things is sort of like the 16-dimensional array where for each location in the image you know something about the cell that's there and now those are just cell-based assays just like anything else and now you can come out and do the sort of tisny and umap plots of of the cells thinking of them in 16-dimensional space and then identify subsets and go back into this and say oh look because the cells that have these sorts of genes expressed they're along the edge of these uh of these villi of some sort and these other ones are in the inside and they look different and these ones at the bottom again look different we can start to really apply computational methods the challenge here is going from a slide image to you know all the data that I need is is quite a substantial amount of work and it's not clear how I'm going to do that uh you know for a thousand slides or ten thousand slides and then back to the the questions that I raised at the start is when we better make sure we know what we're question we want to answer and that we had a pathologist choose the slides for us that are the most important ones for us to answer that question because we don't want to spend a week of somebody's time or a month of somebody's time generating data that is ultimately not going to be useful downstream and then as we and others do this all over the world how do we put all these things back together into data warehouse type operations which are sufficiently well annotated that individuals can come and reuse this data for other questions that we didn't have in mind at the time another one paper you know exactly the same story just trying to sort of give you an example of a lot of the stuff this is from Peter Campbell's group precision precision oncology for AML is you know basically is going to need large knowledge banks of match genomic clinical data that support clinical decision making so this is a problem we we all need to sort of start to think about how are we going to do that because that's the one of the real roadblocks between idea and and clinical change so just getting close to the end here so I hope what I've shown you is that there's some really interesting and potentially transformational approaches to delivering treatments and helping people understand not just you know real disease but wellness and everything that affects humans as they go through life and make sure that they have as comfortable and productive a life as they can I hope I convince you that import how important large well-curated data sets are going to be for success and that it's really essential that we start to think as a computational scientist how do we speed that up without sacrificing quality and then well I didn't talk about R at all hopefully everybody saw that our R you know at least in my mind are and Python and things like that are really going to be the essential building blocks under here for developing models the one piece I didn't put here that probably I should have is that part of this is we really start have to start paying more attention to data technologies because as these data sets get big we we can't you know sort of rely on that you know somehow it just turns out okay to you know how fast can we get them off of disk how fast can we get them to a CPU or a GPU you know if it takes me five minutes to do you know one sort of model fit in a in a study then I'm limited in time for how many model fits I can do if I can get that down to 15 seconds or a millisecond then I you know my ability to just explore the range of models is pretty pretty different so at that point I shall stop I haven't seen any interruptions so hopefully I'm not just talking to myself no no not at all and we have several questions first I'm hearing Joe and so hopefully I don't get kicked out can you hear me can you see me now yes all right perfect all right so you've outlined a future and in the recent future maybe 20 years where there's going to be a tremendous amount of technical knowledge available scientific knowledge and technical know-how and this is going to impact you know systems like the practice of medicine, public health, complex systems that interact and are very slow to change well can you give any advice to how like the physicians listening here or maybe public health officials what they need to be doing to prepare for this how can they help change yeah I mean I think I think it's about you know the place where people can help us to take their specialty so if you're a clinician then what's the problem are there problems that you can identify that you say hey you know if we had a tool that sort of solve this part of the diagnosis or solve this part of my you know sort of clinical care of somebody I could see more people or I could do better or here's you know we're using this and I don't think it works very well right it's identifying you know clinicians should be able to really help us identify the the opportunities and then as things come along to basically be willing to get involved in the clinical sort of trial part like like how do we get it out into practice right and then for public health it's I think a very similar role is like given what I know what you know what are the problems that I think this will help me solve right okay so so looking into the future again in clinical practice and its place with artificial intelligence and machine learning what's the outlook for specialties such as pathology and radiology you know once they provide enough cases for the models make correct predictions but do you think that it'll reduce the need for the number of physicians in the workforce in the in that area so so far that has never turned out to be true um when I get asked this question I always tell a story about when I was an undergraduate and in the math department in the one week all of the women that were employed as typists for and at that time it turned out what it was I think exclusively women they were very unhappy because there was this great new thing called a word processor that came along and suddenly you know the word processor was going to make it so instead of four people typing manuscripts we were down to one person typing manuscripts right and so they were very worried about their jobs and you know everybody knows what happened we didn't reduce the workforce we just changed what the workforce did you know mathematicians used to never revise their papers they would use white out and hand right in a little bit right but the type version was a bit now what they did what they did then was they basically meant for real revisions and so you know the same thing I believe is going to happen with pathology we're going to take the the sort of boring common cases which you know if I was a pathologist and things came along that could be read by a machine I'm personally I would be very happy if they could be read by the machine so that I could spend my time working on the problems that are really hard right that's that's how you get innovation into the into medicine you take away the problems that are easy and put your skilled workforce on the problems that are hard so um do you think how do you think you could persuade mds to act on say algorithm without a without a random controlled trial you know versus standard of care I'm not sure I want to I mean I I'm a pretty big fan of random randomized controlled trials I mean I think that's how we've managed to push stuff forward and I really haven't seen any examples that that have convinced me that there are there are better ways I mean there are better ways to run clinical trials that I will admit to but randomized controlled clinical trials are you know I think an essential piece going forward so do you think if that if a wearable IOT medical sensor data was scored to a sufficient statistic data stream the data reduction would make wearable sensors sensors useful in gathering environmental data well they could you know there was a bunch of stuff that I'd seen must be almost 10 years ago now of these sort of silicon bands that people could wear that basically just picked up environmental exposures and then at some regular interval every three months or something you took it off and sent it in and ran through a mass spec so if you wanted that kind of stuff you know what we find out or what we've found at 23 and me when I was there is that you know just using zip codes and things like that for where people live gives you a reasonable idea of what exposure is you know I think some of the concern around these wearables about you know we're not trying to get I don't think exact values all the time sometimes it's just you need to be reasonably close and so that when you average it out over a decade you you know you have a number that's interpretable and then not to worry about did I get exactly the right number of steps that that you walk today and the big changes are not going to be you know about that it's going to be you know if you suddenly change how much you how active you are by some amount right then that's a big change and maybe that associates with you know a healthier lifestyle and you know at least from that Apple watch and folks can see I have one when you do that you also get heart rate and if you know changing your walking changes your heart rate right we sort of get these two readouts at once if I want to know how fit somebody is you know I keep saying the the best thing in the world would be to make them climb 10 flights of stairs measure their heart rate at the start measure it at the end and how long it takes it to go back to where it was at the start it's going to be better than anything else and the watch doesn't need to be that accurate to get that about right I think we're at times over you know sensitive to accuracy um how do you think we can tackle the problem of well curated data do you think our could be a driver in data engineering I don't know if ours I mean the way we tackle it right I mean it's it's getting ontologies getting you know people to come close to agreeing on a set of words to apply to things and what things are cinnamon synonyms to each other you know again I think it's let's not try to say here's the one way of describing everything but let's try to get reasonably close so that individuals can find the things that are close to what they want and then if they need to recreate them individually with their own pathologist next to them that that's what you should do but you know it's it's it's very hard to to you know I don't think you could some come up with a thing that's okay all the clinical pathologists in the world have to agree that this slide has this evidence for this disease like they just won't not and not because any of them are wrong it's just not how they think about things and so it's it's better to have reasonable ways of annotating and then ways of mapping between the annotations do you know if anyone has tried to present pharmacogenetic data in the EMR text at point of care to M.D.'s prescribing specific meds I mean the person says it seems likely the huge interface problem as genetic assistance statisticians use very different jargon M.D.'s want prospective data preferably RCTs is there a quality of the data enough to affect care? Well certainly if if what you're asking me two things if you're hopefully we're on the same so in the one case for something like G6PD that I showed if you have a mutation for that there is a you know pretty standard way of pointing that out and it would be good for you to know if you had that mutation because as I said there are certain drugs that you really shouldn't take and there are certain foods that you really shouldn't eat and if you do it can be life threatening and so that's pretty straightforward for a lot of the SIP genes and other genes that we know metabolize drugs quite differently what happens today is if a doctor wants to prescribe those in general they sort of have two approaches they start low and titrate up or they get you tested and then try to narrow in on the dosage that way so you know there is a standard of reporting genetic evidence I don't think it's a big problem I mean around PRS's it'll be more interesting right and that's largely the problem how do we convey to a doctor the right idea of risk right and that's not going to come from statisticians but right that's going to come from people that do UX design that understand how to write words that help people understand risk and through changes in training in medical schools so the clinicians are you know able to understand what's being told to them around the genetics and the genetics of risk so to close out here do you have any thoughts about the ethical considerations that you know go along with all the technology and knowledge that you're anticipating yeah you have to pay a great deal of attention you know you need to to you know get people who are consented for the work that you want to do you know again my experience at 23 and me was that most people that have diseases want to be involved in how we solve and make those better and you know nobody who has you know there's virtually no disease that is well handled for every human right it's there's a lot of room for improvement and the population wants to be more involved than it is in medicine so I think there's huge opportunities there but the ethics are the ones of like making sure that you have consent I think there are ethics that are you know right now not being well observed which is we don't do enough diversity and so if we focus our attention too much on European populations which is sort of where the genetics world has ended up in part because of you know access to samples and data you know it's important that we start to branch out and make sure that we cover you know a broader range of the population and then it's important that we you know bring in good statistical methods to make sure that what we're saying is applicable to the individuals that we intend to be treated you know I think that the piece at the very start you know where I tried to say you know for reasons I don't understand you know machine learning algorithms seem to be getting set up by people who you know it doesn't matter what the input is they're going to give you an answer and folks that have read Brian Ripley's book on this you know one of the best things that you can do is to say I never see a data piece like that and so I'm not going to make a prediction well thank you Robert our time has come to an end you've covered an extraordinary amount of material and I'm sure we're going to be thinking about it all to the next our medicine conference and we're very fortunate to have you here and we will wish you the best in your new position thank you thanks so now I guess this break on the agenda