 Hi everybody, welcome back to The Cube here at Hadoop Summit in San Jose. I am Jeff Kelly with wikibon.org. We're here again on Silicon Angles, The Cube. My co-hosts, Dave Vellante and John Furrier just stepped out for a minute and I'm going to have a conversation now with our next guest. We've got a big data practitioner on. His name is Valkov Petrishek, a principal data scientist at eHarmony. So Valkov, welcome and apologies if I mangled your last name a little bit. Thank you for having me, thank you. Absolutely, thanks for coming on, really appreciate it. As long time watchers at The Cube will know, we love to get practitioners on to talk about what they're really doing with big data. So why don't we start with, tell us a little bit about yourself, kind of your role at eHarmony as a data scientist. So I run machine learning applications at eHarmony trying to decide who we should introduce to who and when. And for that we use Hadoop and large scale machine learning. And eHarmony is a bit different than your typical dating sites. Typical dating sites are search based. So you specify your criteria and then you get back results that match your criteria. The eHarmony is a bit different. So we were founded by Dr. Nikol Kuboren who was a marriage counselor and clinical psychologist in Pasadena. And he was working with married couples and quite a few of the married couples who were coming to him were in failing marriages. And so he thought like wouldn't it be nice if he could help people to find and meet the person that they are not only attracted to but also that they are compatible so that maybe 20 years later they would be still enjoying spending time together and doing things. Right, so talk a little bit about the underlying data infrastructure and some of the tools you use and some of the techniques you use. So, because you mentioned other sites are more search based. You, that's what the user basically says here's what I'm looking for and then basically a search engine and spits back some results. So you actually write algorithms that actually try to determine based on a number of things or a number of characteristics I imagine who is the best fit and who in 20 years you'll still be happy to be waking up next to. Exactly. So tell us a little bit about how you go about actually developing those algorithms and what are some of the characteristics and how do you even get the data that you base those algorithms on? Yeah. Yeah, so to match people effectively we need to solve three separate problems. So the first one is compatibility for the long term which I just spoke about. And then not everyone who is compatible is actually interested in each other. People who are psychologically compatible might, there might be too big age difference. They live on the other side of the planet or there might be some other deal breakers. So that's what we call affinity matching where we try to match people who will be actually mutually interested in each other. So not only one way but also the other person will be interested because to get married you need consent of both, right? And then finally the third problem that we are solving is distribution problem. That means like who to introduce to who and when. And then so for the first problem which is compatibility we have a relatively large scale sample of as measured by psychological research we have thousands of married couples and we know how happy they were, what kind of personalities they are, et cetera. But still this is not really web scale data. So the techniques that we are using there to build the models are slightly different. Where we use Hadoop and large scale machine learning that's for the affinity part. So to predict whether two people would be interested to talk to each other we have this historical data because eHarmony has been around for more than 10 years. We have all the information about which people we have matched and did they communicate where they interested in each other, et cetera. So that's where we employ Hadoop and large scale machine learning. And then finally for the distribution step we have some in-house solvers that solve the problem, you know the constraint optimization who to introduce to who, when, you know considering that we don't want to overload people by dumping thousands of results on them but we want to give them sufficient choice. So the first part is all around kind of more traditional type of research where you're trying to just understand based on I guess research that psychologists have done over the years and more traditional literature. Definitely informed by the psychological and marital satisfaction research. So let's talk about that middle portion where you're using Hadoop. So you've got all this historical data. So you've mentioned you've been around for about 10 years. So what kind of data do you have? I imagine the kind of questions that you ask people who when they join eHarmony today are very different than they were 10 years ago. Tell us a little bit about the type of data and maybe how do you get the data? Do you ask people, is it a questionnaire or how does it actually work? So yeah, of course over the time our questionnaires have evolved but still there are some questions that have survived this whole process or they sort of get at the underlying psychological traits at the same one. So some of these haven't changed very much. But so we get the information from users from this relationship questionnaire that they need to fill in when they are joining the site which is right now it is about 150 questions. It used to be about 500 and that's a lot of data. We get to know people immediately as they get in today after the talk I was getting this question how do you solve the cold start problem? How do you make recommendations for people who are just new to the site? We actually know a lot about people because they tell us. They tell us how far they are willing to travel. We know their personalities. We know all their attributes. So but this is not everything that we are using. We also collect the behavioral data. Like people are using our site. They are communicating to some people. They are not communicating to others. They are engaging in different types of communication and they are logging in every day, every other day, several times per day. They use different mobile devices. All this information we collect and that's available for the machine learning models then to make decisions on who is right for who. All right this is really interesting. So now a couple things that kind of came to my mind as you were speaking. So you're asking a lot of questions, about 150 questions. Now if you were to base your algorithms and some of the recommendations you make on how people honestly answer those questions or because we know some people won't always answer honestly. Maybe you don't even realize you're not answering honestly but do you actually say take that data and actually take that into account when you're running algorithms? How do you go about filtering out some of those things people think they believe but maybe don't really? Yes, this is a very interesting comment here. So yeah of course there is no way you can force people to tell the truth, right? Like they can choose every option on that side. But however we set up the whole process in such a way that it's obvious to the users that the incentives are to answer everything truthfully so that they get the right matches. Because if you are going to pretend that you are a different personality type you will just be matched with the wrong person, right? So the incentives are definitely set up so that we elicit the truthful preferences from people and their real behavior. And yeah, it's a science in itself to basically design these questionnaires in such a way so that you actually get at the underlying psychological traits and not at what the person would like to be, right? So that's one thing. But for example for preferences we have some questions which actually correlate even with the user attractiveness, sort of self-reported user attractiveness so you would think that that thing might be very unreliable. But when you look at the data you actually see that people who report a certain level of attractiveness, how they perceive themselves, do well with other people who answered in a similar way, right? So even questions where there is no way to verify are useful for predicting who will be interested to talk to who. So it's not as so much whether the answers are truthful but you can still take that data and derive certain things from it and connect people who are likely to connect. Yes, so even data which might be slightly skewed could be predictive. But on the other hand, we don't see this problem very much. I know there were studies on other dating sites where like men would inflate their height and income and other attributes. We don't see it that much because you know like answering our relationship questioner takes time. Everyone on the site is for long term, for serious relationship. So once you go through this big process and you go on your first date you can be like revealed very quickly and you can be reported by our users as like misrepresenting yourself. So it's not the most efficient way to lie and you know. Right, well I guess my point was that some people, they have perceptions of themselves that aren't always accurate. So you can sometimes, but you can still use that data to make a prediction. So you also mentioned data around kind of how people are using your site, are they using, how are they navigating through eHarmony.com? Are they using a mobile device, a tablet or a phone? How do you use that data? How does that come into the equation of matching people up? Yeah, so we use this data in a very similar way as any other features and attributes we have for about our users. So you may tell me explicitly where you live and then I can observe you that you are talking to this type of pre-spell. Then in the end, all this information is just features for this machine learning algorithm which you can think of as a black box which takes all these inputs and produces a prediction, right? So it is not too much different. There is some thought that goes into engineering of the features. You need to think about what kind of behavior might be predictive. So if you want to predict communication between people, typically some kind of past communication patterns will be very predictive. So there is some feature engineering that goes into it which might be slightly different than how you would represent the place where that person lives. But otherwise, it's just another feature. And our models, they are using thousands of attributes which can get expanded into some higher level features which are derived from this. And all of them combined contribute to the predictions and each of them contribute incrementally to our ability to predict whether the two people are right for each other. Really fascinating. So talk about one thing you mentioned to me before we went on is kind of how some people tend to think about dating sites and what they do seem very similar to recommendation engines you might see on Amazon. But as you pointed out, well, you know, on Amazon, Amazon recommends a book you might like but the book doesn't have to like you back. So that's not a challenge you have. So that sounds like it makes it a certainly more complicated and difficult problem to crack. Yeah, absolutely. It's an additional restriction. I mean, like if you are recommending movies or books or you are very happy that you have a best seller, right? Because you fill your waterhouse with this best seller and you sell it to everybody and you have economies of scale, everything is simplified, you are super happy, right? In our case, we just cannot marry one person to everybody, right? So we need to make much more diverse recommendations. And so this is an added challenge and you know, all our models are taking into account that the other person needs to like you back also. We want, we are optimizing for the number of marriages and number of good marriages that we make. So let's turn to some of the technology you're using. So you mentioned Hadoop. So you are on the data science side. So imagine you're probably, you know, working a little bit higher up the stack using certain analytic tools and to really manipulate the data and test your algorithms, things like that. So what technologies do you use on a daily basis to help you craft these algorithms and really improve, you know, what eHarmony does? Yeah, when people think about, you know, machine learning and data science, they think about these like algorithms and how you tweak the little algorithm, the little features and knobs on the algorithm to perform better, right? Turns out that like most of the work is actually preparing the data and typically you get much bigger of in from, you know, adding an additional feature which you didn't have in your data set rather than like switching your algorithm for a fancier one, you know, or like ensembling the algorithm with another different technique, right? So this is generally where you spend a lot of your time. So we store all our data in a in-house Hadoop cluster in HDFS, which has been very useful because, you know, the data that is cleaned is present. It's available to everyone. If anyone has a question, they can dive into it. If you want to test some new theory, we can go and do it very quickly. So it's stored in HDFS and on top of that we run Hive. So Hive provides this SQL interface so anyone with the knowledge of SQL can like test their theory. And we use then Hive to join the disparate data sets which are coming from production and from, you know, other systems. Join them together and then do the machine learning modeling. And then, you know, that comes this interesting part that's typically talked about which is like building the actual model to represent the features, build the model. We use a lot, Wolfpau Wabit, which is an open source, large scale machine learning supervised online learner, written by John Langford who is now at Microsoft Research in New York. And that's a super fast machine learning tool and it can actually even scale on the Hadoop cluster using the already use parallelization that is implemented there. But it's already super fast. It could run on your laptop, you know, on data sets of most companies, you know, that are probably coming here. So that's a very cool tool. Otherwise, other tools like our statistical package for visualization, it has a decent implementation of gradient boosted decision trees. There is another interesting implementation of regularized greedy forests. All these kind of algorithms bring a different, you know, view of the data and then you can use the features that you discover or maybe like ensemble them together with all together to get better predictions. And finally, we had some luck also with like looking for new features using genetic algorithms. So that actually provides surprising relief also even on top of gradient boosted trees. Wow, so really you're open to whatever works and really looking at being flexible. So I guess actually we've got time for just one more question. So, you know, we have a lot of people out there who are watching want to kind of understand how to better work with data. So what are some of the maybe attributes in a data scientist you would look for if you were building your own team? It certainly sounds like one of them would be, you know, use whatever tool is useful, really. But what are some of the attributes you look for? What makes a good data scientist? Yeah, I like the definition of Josh Wells who defines it as like data scientist is this guy who is better at programming than any statistician. And he's better at statistics than any developer, right? So, you know, there is like this whole spectrum from someone who really knows the systems and you know, can write code that scales very much. And on the other hand of the spectrum someone who does basic research in statistics, you know, on the statistical methods. And actually there is space for a lot of people with different skills. I find that like in industry, you generally, it's easier to teach people who are developers, you know, and who are very good at programming machine learning and how to use it and how to build things than it is to teach a statistician who is very interested in the theory behind statistics how to develop, you know, practical recommender systems. So I think like for a data scientist in the industry definitely it helps to have a very strong development background. All right, great. Well, some great advice. So Valkalov, thank you so much for coming on. I appreciate Valkalov for the harmony, principal data scientist there. Just learned a lot about how eHarmony helps match people up and I just want to assure my wife I'm more happily married. That's not why we did this segment. So that was my lame attempt at humor. So anyway, thank you so much for coming and we really appreciate it. We'll be right back. We've got more guests. We're going to be going all day here at the Hadoop Summit 2013 in San Jose. You're watching theCUBE.