 Jonathan Gluck from the Heritage Provider Network. So, Jonathan, let's start with you. What is the Heritage Provider Network? And what did you guys announce today? Can you get real close to the mic so we can hear you here? Can we block out all this crazy background noise? And so welcome to theCUBE, first of all. Thanks for coming in. Thanks for having us. The Heritage Provider Network is a physicians group that's based in Southern California. We provide healthcare for approximately 700,000 members, making us one of the largest groups in California, the largest bi-geographic scope. And I'm sorry. And that's really what our nuts and bolts businesses is providing healthcare for our membership. So you announced a prize today, right? A competition, right? What is that all about? We have decided to launch a $3 million prize for a team that creates an algorithm that predicts the hospitalization of the members of a given patient population. The theory is that if we can create an algorithm that predicts the hospitalization, we can then talk to the doctors for that patient who can attempt to provide care to keep the patient from needing to be hospitalized. Unnecessary hospitalizations are both a drain on resources as well as simply not good for an individual. Hospitals have their place when care is necessary, but are also dangerous places to be if you don't need to be there. So we're talking about an application here of data science that actually has social implications and implications in terms of making lives better. Now we've talked about in the past about, well, maybe it's predicting traffic patterns and things like that that are relatively trivial compared to saving lives. So Jeremy, talk a little bit about Kaggle and what it is you guys do and where you fit in this whole competition. Well, hey, don't knock traffic patterns. Nobody wants to be late to work. But yeah, sure. I mean, there are even more important ways we can use data. So one of the interesting things about this question of how do you actually leverage data is it often takes some pretty serious number crunching algorithms, right? And it's really hard to work out who's the guy that's got that number crunching algorithm I can use. I could pay enormous amounts of money to big name guys. They could come up with an algorithm and then I don't even know if it's any good. So what we do at Kaggle is we run data mining competitions. So people put their data up there on the web and then our competitors just click a button, download it, do whatever the hell they like with it to come up with an answer. And then it's data mining as a sport. They put that answer back up on the web just by clicking another button and they appear with a score on the leaderboard. So there's teams, there's scores, there's competitors. And so this approach of running competitions, we're finding and Netflix founders as well is a really great way to get the most out of data. So that's why Kaggle is running this HPN Prize. Now, we were talking off camera. This isn't so-called big data, is it? It's pretty small data. Yeah, well, I mean, maybe it's medium-sized data, Dave. I don't know. To me, big data gets big when you need multiple machines to look after it. This data set easily fits into memory, very easily fits into memory. And I've competed in a lot of these competitions. I can tell you that the algorithms I see or that I use that do well are generally ones that I run on my laptop, normally using free open-source software. So if I think back to 20 years ago when I started in the world of data science, we needed big multi-machine setups to run what today we'd think of as simple 100,000 road problems. But today we can run these problems on our laptops. And HPN's data set is a few million records once it's all put together, easily small enough to fit in pretty much any software you can think of. So Jonathan, we talk a lot these days about health care costs, obviously new health care bill passed, and there's a lot of finger-pointing going on. The insurance companies point at the doctors, the doctors say, hey, you've got to have accountability and patients need to get more involved in decision-making. Is HPN through data and data science actually attacking that problem? Is that part of what we're seeing here? Yes it is. What we believe firmly at HPN is we need to move from what we think of today as a sick care system, meaning we provide most of the health care to individuals after they become ill. We attack the problem after the horse is out of the barn. We need to move to a system that focuses on keeping people healthy. It's really when you think about it the only way as a society that we can provide all of the health care needs that people have. We just simply do not have the resources as a society to attack the problem after people have become sick. And so we think of this prize as being a large step towards that, trying to predict who is going to become sick so we can then focus on keeping them healthy before they become sick and need to use those scarce resources. So how's the prize work? Is it winner-take-all? There's got some alpha geeks out there that are sort of, I think, chomping at the bit here to get involved. The prize is going to have two components. There will be one component which is the person who creates the algorithm that predicts the given percentage of hospitalizations. And that final figure is yet to be set, but the rules will obviously be out before the April 4th launch date. We'll win the $3 million prize. In addition, we intend to have milestone prizes along the way, meaning the leading team after presumably six months, one year, a year and a half, however long it takes for someone to win the prize, we'll win also a significant sum of money. So the duration is open-ended? The duration is open-ended. Someone will need to come up with a algorithm that meets the minimum requirement. And when you come up with that algorithm, you win the prize. So the first one over the finish line wins. First one over the finish line wins. Dave, can I give some suggestions to the alpha geeks? Yeah, please, absolutely. I mean, I love competing in these competitions and I know a lot of people who say to me, Jeremy, I've so much been thinking about competing in one of your competitions and I say, why don't you? And it's kind of like, oh, I don't know, I haven't got time or I'm worried I might not do very well. There's all kinds of excuses, right? My advice to people is to start getting involved in competing right now, okay? This huge $3 million prize is gonna be around for a while. It's not easy. If you can imagine working out who's gonna end up in hospital and who's not, it's gonna require lots of great model building, looking at things like everything from drug effectiveness to number of interactions, to people's characteristics and so forth. So I really think people need to start actually getting involved in competitions. So they can go to the heritagehealthprize.com website, register now. Then they can, on April the 4th, they can download the data set. Pretty straightforward. But in the meantime, go along to Kaggle.com and have a look at one of the other competitions and start learning how to compete. Meet some people out there, build some teams, find out about how this all works. I think it'd be really great experience. Maybe try it out on some other competitions and sharpen your skillsets. And look, if you give it a go and you come last the first time you put a submission in, that's okay. Think about it overnight, come back the next day, try something a bit better, you'll keep moving up the leaderboard, you'll have a lot of fun doing it. Now that data set is a fixed data set, right? At time zero, whatever, right now I guess, right? Or is it April 4th, you said? April 4th will be the launch date. Okay, so you'll launch date and the data set will be fixed at that point in time. Yes. Is that right? Yes, it will be. It won't be any changes made to the data set. Okay, so Jim, that's good advice for people out there. So where do they find these resources? To Kaggle.com is the best place to start. Kaggle.com is to check out some other competitions. And they all have basically the same kind of framework. So there's always a file that you download which has got information about outcomes in it. In the health prize example, it'll be hospitalization outcomes. There'll be another file that contains all of the same information, but without the outcomes included. And that's the thing where you've got to send back a column of numbers saying, this is what I reckon the outcomes will be. Then on our site on the Kaggle website, we split the result into two pieces. There's one which is used to build the public leaderboard where you see how you're going. And then there's another piece that we actually keep separate, even again, hidden from everybody until the last day of the competition to actually double check that the leaderboard result can actually be applied to a whole new fresh piece of data as well. So there's this standard framework that we use that works really well and you can try using it now. All right, so definitely go check that out for those of you who are inclined to attempt to go win $3 million. That's a serious chunk of change. And so that URL is Kaggle.com. Mark, in case you want to bring it up and show people, as you can see guys, we got a lot of people watching here online. So this is obviously an interesting segment. So while I have you two, I wonder if we can talk about the conference here. We're at Strata, the big data conference, right? We were talking, well, let me ask you. It's a data conference, it's big. It's a data conference as big as a lot of people here. It's in the heart of Silicon Valley. Is big data, Jeremy? Is big data overhyped in your opinion? Some people, Dave, have big data problems. Google has big data problems. Bitly has big data problems. You need to come closer to the mic. Most people, I think, have medium-sized data problems. They have the kind of data problems that if you use a good algorithm on your laptop, you'll get great results. So if you've got less than a few gigabytes of data, then you probably don't need to be sending it out into clusters. You can probably use a fairly simple streaming algorithm using open source and free tools. And the great thing about doing it that way is that rather than spending hours waiting for the algorithm to finish, you can use much more powerful algorithms, try many more algorithms, spend a lot more time crunching the data and getting better answers. So Dave, I compete in a lot of these competitions, right? And I've had some good results. And the times I've had good results is when I turn the big data problem into a little data problem and I find a way to run lots of algorithms really quickly and try lots of things. So you're saying in that instance, it's about speed and getting the most efficient application of those algorithms as possible? It's about speed of development. This is all about prototyping, okay? Now, when this is all said and done, for example, in the Netflix prize, the guys that won that had 300 submodels. Now you can't build 300 submodels on a billion rows of data, right? So what they did is they built these things on smaller amounts and then at the end, at the very last phase, then it's a big data problem. That's where Netflix goes and say, oh, how do we do this in real time for a billion customers every day? That's the big data problem, but the prototyping, the algorithm development is not about big data tools in my opinion. Right, now, of course, a lot of people, again, we were talking offline, indicated that you said a lot of people using Hadoop. They don't necessarily have to use Hadoop. Why do you think they do that? Just because it's such a hot area, they want to be part of the next big thing? You're going to say controversial things now. Oh, come on, you might as well. Why do they use Hadoop? You think that's overkill? I mean, it's important. The reason I'm asking is that we have a lot of practitioners in the audience that are trying to figure out, should I be applying Hadoop to my business and where should I be applying it? So help the audience understand where they should be applying it and where they shouldn't. Hadoop is big because Java is a terrible platform to be working on these kind of solutions. Because it's so damn hard to do in Java, you've got to have incredibly clever tools to do it more easily or even to do it at all. Hadoop works on an incredibly difficult engineering problem which is making Java work on these kinds of problems. The people that work on the hardest engineering problems and gets traction on it are the coolest people in any industry. That's why Hadoop is cool. So for some people, it's fantastic. For a lot of people, it's the wrong tool for the wrong job. So it's the pain reliever for Java. Now Jonathan, now having said all that in the medical and particularly in research communities, big data has a lot of applications, doesn't it? I mean, what are you seeing there? Can you talk about that a little bit? I would probably defer on a question like that, something because that's not my area of expertise. There clearly is a ton of data in the medical field. There may be more data in the medical field than anywhere else. The question really is how well do we use that data today to allow us to do what medical care really should do which is take care of patients? And I don't think we do a particularly good job of that and that is something we're trying to begin to solve with a prize like this. Right, right, so how long you guys here for? Are you gonna be here tomorrow as well?