Vaclav Petricek - Hadoop Summit 2013 - theCUBE - #HadoopSummit





The interactive transcript could not be loaded.



Rating is available when the video has been rented.
This feature is not available right now. Please try again later.
Published on Jun 27, 2013

Jeffrey Kelley sits down with Vaclav Petricek inside theCUBE at Hadoop Smmit 2013. Vaclav Petricek is the Principal Data Specialist with eHarmony.

Petricek runs machine learning applications at eHarmony, in order to decide who they should introduce to whom, and when. For that, they use Hadoop and logical machine learning. "eHarmony is a bit different than your typical dating site," brags Petreicek. Those are search-based, with results generated by certain search criteria. The founder of eHarmony is Neil Clark Warren, a marriage counselor. After years of counseling couples in failing marriages, he wanted to help people not only meet the people they would be attracted to, but also the people they are compatible with.

As for the underlying technology that makes this possible, Petricek explains: "To match people effectively, you need to solve three separate problems. The first one is long term compatibility, then there's the affinity matches (based on age and location), and finally, distribution (who to introduce to whom and when)."

An affinity for Hadoop

Hadoop and large scale machine learning are used for the affinity part. To predict whether or not two people would be interested in talking to each-other, eHarmony uses the historical data generated by their 10 years of operations. As for the data itself, Petricek clarifies: "Over the years the questionnaires have evolved, but certain questions have survived. It used to be 500 questions and now it's down to 150, which is a lot of data, enough to 'know' someone. That's how you can still make recommendations to people who joined the site recently." The questionnaire alone is not the only tool. eHarmony collects behavioral data, when they are logging in and how often, what kind of devices they are using.

Jeff Kelly wanted to know next how the problem of people who are not answering the 150 questions truthfully is addressed. "You cannot force someone to answer truthfully, but we offer incentives to do so, in order to get the right matches. It's a science in itself to design the questions in such a way to get the underlying psychological traits, and not what the person would like to be."

Talking about the technology itself, Petricek explained: "We store all of our data in-house, on Hadoop cluster, in HDFS, and on top of that we run Hive, which provides the SQL interface, and then we do the machine learning modeling. We use a lot of vowpal wabbit, a large-scale machine learning open source written by John Langford, that can scale on the Hadoop cluster. And lastly, we use some genetic algorythms."



When autoplay is enabled, a suggested video will automatically play next.

Up next

to add this to Watch Later

Add to

Loading playlists...