 Live from Boston, Massachusetts. Extracting the signal from the noise. It's theCUBE, covering HP Big Data Conference 2015. Brought to you by HP Software. Now your hosts, John Furrier and Dave Vellante. Okay, welcome back everyone. We are live in Boston, Massachusetts. This is Silicon Angles theCUBE, our flagship program. When we go out to the events and extract the soup from the noise, I'm John Furrier, my co-host Dave Vellante. We're here in Boston live at HP's Big Data Conference. Hashtag HP Big Data 2015. Join the conversation, go to crowdchat.net slash HP Big Data 2015. Ask any questions, I'm happy to answer them. Our next guest is Toby Bloom, PhD Deputy Scientific Director, Informatics New York Genome Center. Welcome to theCUBE. Thank you. So tell us, I mean, I'll see. When I hear that, I hear science, I hear data, which with cloud computing, and we've been covering a lot of cool things around how people can get all this compute and really move the needle in terms of discovery. So thoughts on that? True, where's the state of the union on this whole trend? Well, genomics certainly is big data. We're probably nowhere near as big as Google in terms of collecting data, but I collect a fair amount. I collect about 12 terabytes of raw data a day off sequencing instruments and probably store 30 to 40 terabytes a day. So there's a lot of data there, and there's a lot of complexity in the analysis and a lot we need to do to be able to find the causes, the molecular causes of disease and hopefully cures for them and have a real impact on medicine. So what's your approach on a day-to-day basis? Take us through the day of life with all this new opportunity because you have now a lot more, you're storing more and more data, you certainly are ingesting your data full, as we say. You've got a lot of data. What's going on? How do you get to all the insights? How do you sort through it? It's a long complicated process. What we're getting off the sequencing instruments is billions of short strings of A, Cs, Ts and Gs, the four letters of DNA. And we do a lot of processing on that to figure out where each of those strings actually fits in the human genome and try to find the variants, the things that are different in any single-person genome from what the reference genome is. And once we find those variants, the job is to figure out which ones are related to disease and which ones aren't. And for common diseases, that can be lots of variants and lots of combinations. We have a huge statistical problem. There is a true answer out there, that the data is all probabilistic. And so we're trying to get the best answers we can and improving the algorithms for working on it. The analytics are very complex and we do a lot to continuously improve the analytics as we work on the data. And then we're putting all of the data into one big database, an HP Vertica database, actually, to make it available to scientists to ask the questions they need to do the next steps in analytics. So what's the big overriding goal of the organization? The overriding goal is to find the molecular basis of disease and improve healthcare, improve medicine and do that now. The genome center was formed as a collaboration initially of 12 medical centers. And it's up to about 17 members now. But we work closely with researchers at the hospitals, with clinicians at the hospitals, to have an impact on patient care. So talk a little bit more about what I conceptually see as your data pipeline and how that's evolved over the last few years. That's a long question. Short question, long answer. Well, okay, short question, long answer. We do, let me think for a second about what you really want to know about what we do. So if you could paint a picture of sort of how that process actually works and what's involved. So the process itself starts from DNA in a tube, okay? And that DNA gets cut up into little pieces and sequenced on a sequencing machine after a lot of lab steps I'm skipping, okay? And then our first step is to take all those short sequences and basically pattern match them back to a standard genome sequence we use, a reference genome. In cancer, we do that with the DNA from the tumor and then with the patient's blood and we compare them to figure out what's changed in the tumor. So we're matching, in that case, we're not matching against a standard reference, we're actually matching against the person's own blood sample. We find all of the places, we do this big pattern matching and then when we're finished and everything's placed correctly, we figure out what the differences are. Okay, now, any average person probably has one to three million differences from the reference. Okay. You have to make sense out of that. We have to make sense out of that. A lot of that is very standard, a lot of them are common. We have to figure out which ones are really related to the disease we're looking at and there's a lot of biological analysis that goes into that. There's a lot of analytics, a lot of machine learning algorithms, a lot of statistics to figure out how likely it is that this is a true variant and not a mistake in the sequencing process. And then when we get down to the variants that we think make sense and are most likely the drivers of the disease, then we need to understand the biological process that relates that to the disease. How the symptoms you see are related to that. And then hopefully we can find drugs that will affect the way that variant causes the disease or the progression of the disease and we can interfere with that process and find treatments. The genome center doesn't find the treatments, although we do do some work to figure out how to best match existing treatments for this case to have a tumor that's being driven by this mutation, which of the drugs that are out there, whether they are approved for this particular disease or some other kind of cancer. So you're betting things that you can offer that out. Right, what's the best thing to do for this patient? Got it, okay. And we do do that. And the process that you just described is it a sort of a series of tools that you grab in the toolkit or is it more of an integrated sort of strict rigorous process? It's a set of tools we grab from a toolkit primarily and then we try to improve the tools in the toolkit. But it's not one single integrated process. In many cases we use any number of tools that are supposed to do the same thing and compare the results and try to figure out how to put the results together and find a better result. So it's very much an organic process that's constantly changing? It's a constantly upgrading process. The thing to remember is it's a statistical process. There are errors in the chemical process that mean that there might just be errors in the DNA results we're looking at. And there are variants that occur at all points in the process. And so we're trying to analyze the data the best we can to be as sure as we can statistically that we have the right answer that has some truth out there but we don't know what it is. There's some truth. What's truth is it? It's not a probabilistic process in the sense of the answers are probabilistic and you get some of this, you get some heads and some tails. There's really one answer out there and we just don't know what it is. And our job is to constantly improve the process to get closer and closer to the right. The data is always living then. If that's the case then- The data is, yes, we keep the raw data and if we improve the processes we go back to the raw data. That's fascinating. I want to just ask some personal questions around people because we always talk about these technology, people process technology but more importantly, the younger geners have a 20 year old son, he's my oldest and my youngest is 12. And there's a generation that's fascinated with this tech from clean tech to some of the stuff you're working on. What advice and what would you share with folks out there who are looking to do something they really want to apply this with it's physics, math, or stats. How do you advise the young kids in this generation? Because there's a real energy to work on these awesome problems and new opportunities. Well, to me first of all, my first advice to any kid is to go with your passion. But these are problems that I'm passionate about. I mean you can probably tell by just listening to me. The idea of making major improvements in human health, not just in cancer but in finding the causes of Alzheimer's disease or diabetes or autism or things that cause such pain in so many people's lives both the patients and their families. And we have a way of approaching it now that it isn't new but it's certainly different and it's way more powerful. And so if you're interested in that, go for it. And you want to learn a lot of math and you want to learn a lot of biology. You can, the math, people are going to hate me for this but the math is probably more important than biology. I will hire statisticians who know nothing about genomics. Hit the books. Okay, we can teach them the genomics as long as we have somebody who understands the statistics and whether we're getting the right answers. Whether we're really going to help people and how to make that as accurate as possible. Yeah, Mike Stonebreaker yesterday was commenting on that one point. Data analyst, data science, really it's about stats. He was really harping home about statistics and being, using the probabilistic mentality whether it's theories and the more data and the more data we have and the more we can bring data together from the same disease, from different diseases, from healthy people, the better statistics we can get. I mean, you're living- So the more data I want. You're living a use case that to me I'm so passionate about because I'm not that I'm, I love health, cure health. I wish I could do more. I'm not a big math whiz and I was okay but I wasn't kind of PhD level. But the internet of things, really you are working use case of internet of things. You've got a lot of instrumentation and we're seeing devices now on people, in hospitals and in general life that now is pouring data into the system. So this is really the future. Your path is the vector. We are expecting to integrate all of that data. We are expecting to take data streaming off devices and we're expecting to take clinical data and medical records and eventually be able to integrate that with all the genomic data. It's essential, especially for these very common complex diseases to have all of that data. You were saying you want more data. You just said that. And I want more data. This morning we heard more data means more complexity. It doesn't necessarily mean better answers. So what are your thoughts on that and how do you address that challenge? From my point of view, more data used correctly means better answers. Okay, and it means better health. Yeah, and Moore's law and or with cloud computing, compute is getting more abundant. So the supercomputer capabilities can be on the top of a pin. I'm not worried about getting, we will get enough compute if we can get enough data. But really lots of different kinds of data that we weren't thinking of integrating before. We have one project now, a rheumatoid arthritis project where we actually are putting apps on people's smartphones because we want to know if there's a change. Rheumatoid arthritis spikes every once in a while and between it much fewer symptoms. And we want to know if we can tell by looking at RNA data when the spike is going to happen before it happens. But if we can correlate those spikes, that will help us figure out the mechanism of disease and how autoimmune diseases work. But if we can correlate that with patient movement changes before the spike, we can tell them to go to the doctor faster and get the drugs they need to contain this flare before it happens. Yeah, I mean, other thing I want to bring up is obviously crowdsourced is a big part of the world. Now you see Kickstarter, we got the crowd check going on. Lot of interactions now, we're in network society, we're connected and everything can be instrumented, including potentially our brains at some point apparently an embedded chip there. How do you talk to people who might be afraid? I mean, I was just talking with my family members about 23 and me that I signed up for. I want the data, I'm a curious person now, I'm a little bit, take that risk, but they were nervous. How do you get people to embrace this idea that if we collect more data and participate, it's better outcomes for everyone? So I think everybody has family members who are affected by some illness, whether it's cancer or cardiovascular disease or somebody they know with ALS or somebody they know with a child with autism. To solve those problems, to find medical answers to those diseases, we need everybody's genomes. And you're helping yourself and your family in what that might happen in the future and you're helping everybody else at the same time. And there are privacy issues, there are some risks and we should all learn about them, but for the most part, donating your genome to medical research is not a big risk. There's a lot of anonymized data and it's pretty much, it's not, no one's going to come in and say, you know. The thing to understand is that your genome is like a fingerprint. It's yours alone and it's unique to you. If you have something to compare it to, you can be re-identified. If you don't have anything to compare it to, you don't. But people use fingerprints all the time to find people. So we don't know, right? So do you have thoughts on public policy and how that should be shaped to affect better outcomes? I can give you personal opinions. We love personal opinions. Okay, personal opinions, not to be associated with the New York Genome Center. The first thing I would like to see is GINA, which is the genome, it's the law that protects people from discrimination in jobs and in medical care based on their genomic data. Okay, it is limited to those two areas. I would like to see it expanded so that if people donate their genomes, they can't be discriminated against in life insurance or disability insurance or long-term care insurance. And not just them, but their families, right? So for me, I actually have not had my whole genome sequenced yet. Although I certainly would if I was sick now and I will soon anyway, because it's now possible. What's 23 and me? That's not genome, that's just... 23 and me is not a whole genome. It's a number of genes. So I submitted my DNA, and you've gotten some results back. There's other information. I've been waiting until I can do my whole genome because I want all of the information. Genes only make up about 1% of the genome. How do you do a genome? Do you walk down to the local genome store and like is it only New York? I mean, is there places or facilities for that? Or is it special? Okay, so the New York Genome Center has received approval to sequence anybody who volunteers to give their genome to medical research, but we don't have the money to pay for it. So we're asking people to donate their genomes and pay for the sequencing. And so, yes, we are capable of doing that. What? How much does it cost? About $2,000 right now. Okay, so the uptake on the general consumer market is not going to be there. I think if you buy an eye watch, you should get a genome. You know, with the Precision Medicine Initiative, we're going to be able to do, we hope a large number of those without cost to the individual. But it's not cheap to do yet. I can understand that. And we'd like that information. There are projects that are disease specific where we have the funding to do it. Got it. So not very long ago, decade, maybe even less, you had a lot less data, it was much more expensive to process. You had to do a lot of sampling. How has this ability to manage or analyze this data deluge affected medicine? Can you give us some examples or some hopeful signs? There's certainly hopeful signs there. There is now a drug for some percentage of cystic fibrosis patients that helps them live a normal life. I mean, there's a lot of kids out there who would have been unlikely to live past the age of 20 or so, who are going to be okay because of genomics. There's any number of cancer drugs that have been developed based on genomics and a paired diagnostic that says, if you have this mutation, this drug will work. There's lots of anecdotes in the market. My understanding is in the case of cancer that, my inference is that data has played a role in this, is that the way in which cancer is treated is the thinking on that is changing. That's supposed to be sort of, let's force cure cancer versus like HIV. Let's see if we can live with this disease. I think there are a lot of cancer centers now who are thinking that they really should sequence every patient who comes through the door. We're doing a study right now on glioblastoma, which is a very aggressive brain cancer. Most patients don't live more than 12 to 14 months after diagnosis. We're doing a study to do whole genome sequencing on their tumors and on their blood and comparing and some RNA sequencing and trying to figure out for each individual patient what drugs out there are most likely to have the best effect on these particular sets of mutations and figuring out if there are combinations of drugs that we can recommend. This is a research project. All of their physicians are enrolled in this project. We just go back to their, we go to a tumor board to get our recommendations approved and then we go to their physician and say, this is what we'd recommend. It's up to you and the patient to decide whether to take this chance or not. So we're starting to see effects, but your community must be very excited about the potential two decades, three decades. We're very excited and hopefully within the decade, we're going to see major progress. You've seen drugs from melanoma that have amazing effects. Right now that those effects often don't last all that long, but they do, right? So there are a lot of people who will tell you that cancer is going to become a chronic disease before it gets cured, right? We'll go from one drug to another drug to another drug. I'm not a biologist or a physician and probably not capable of making that give me that answer myself. I got to ask, I got two minutes left. I want to ask, how do people get involved? Because I mean, it seems like a no brainer to me that there'd be support for this. You see the ice bucket challenge is coming back for a second year. Both loads of money are being raised. Some are always questioning where it goes. How do people get involved? Why is it taking so long? Why is it taking so long for people to kind of come to their senses on this? And two, what can people do to get involved? Share some information on that. Hard question, okay. We are starting to do some projects where we're crowdsourcing and asking people to volunteer without having to pay to at least do some genomic work on a large populations. New York's a very diverse city and just having a cross-section. Uber has more data than you guys. That's ridiculous. You can do it the other way around. Uber can possibly have more data than me. Can they really? Well, they rewrite them now. I will get to that. That's confidential information. The other thing that I think is holding things back right now are regulations. And I very much believe we do need regulations around this and we need to predict patients. I think some of the regulations were written 10 and 20 and 30 years ago before the internet. Before genomic medicine. Before the notions of privacy and what was important and what wasn't to keep private changed. And so there are lots of limitations on how much, right, there's policies about what data I can combine and what data I can't combine and who can see the data I have and who can't see the data I have. And I think when it gets easier to say we just want to consent patients to use their data for all medical research. And we have some way of getting at the data we already have to use for all medical research. It will make it much easier to have an effect. So you have a policy challenge and you have funding are really the two things, right? That are driving that you're balancing. Yes, that's and funding and we're hoping the funding will come, but yes. And so what should people do? If you have to share folks who are watching, look how can someone who really wants to change the Silicon Valley where I live, everyone wants to change the world, building apps to ship food to someone's house or to hail a cab or to do ad technology, serve better ads, but make the world a better place. What can people do? Well, if you're talking about Silicon Valley, you can certainly ask people to donate money to do genomic studies on the disease they care about most. Yeah. I know Sergey Brin is passionate about this topic. And they should donate their genomes to research. And I don't know exactly how else I want to tell people to get involved. Learn the math and become computational biologists and help us deal with all the data. I think that there's a post 9-11 generation out there that really is different than the generation before, which is they really want clean energy. You have electronic cars. A lot of really good science going on right now. Help us, help us do the science. We need more people doing the science. I desperately need people doing the science. Toby, thanks so much for spending some time with us. This amazing conversation. Super important and really geeky, but also changes the world as well. Thanks so much. Thank you. Thank you for having me. We're live in Boston. It's theCUBE. Sharing our genome with you. It's all about the data. More info and data will be right back at this short break.