 Our fourth contestant is Sam Ventura, whose title is Solving the Identity Crisis. One of these two people is widely known for being one of the best ever to do what he's doing. He's won numerous awards throughout the course of his career, and he's produced some of the best statistical output in history. The other was in a movie called Space Day. Now if you're confused, here's why. Both of these two men are named Michael Jordan. Right, we have Michael Jordan, the world-renowned professor of statistics at UC Berkeley. And on the left, we have Michael Jordan, who most of you are probably more familiar with. A Hall of Fame basketball player who won six NBA championships and spends most of his time now making awkward underwear commercials. Now, I want you to imagine for a second that you want to learn more about Michael Jordan as a statistics professor. So like any bright line of human in the 21st century, you type into Google Michael Jordan statistics. And what happens? Your barrage with 43 million hits on how many points Michael Jordan basketball player scored in the 1997 NBA playoffs. This is the problem I'm trying to solve with my thesis. See, we have these text records, for example, these Google searches. And often these text records are ambiguous, but as Google doesn't really know which Michael Jordan statistics I'm looking for, I'm developing new statistical methodology to disambiguate these text records. So imagine there's a giant database of every paper that's ever been written. Now, am I the first Sam Ventura ever to win a paper? No, of course not. But can the algorithms that I developed in my thesis determine which of these papers, which of these text records belongs to me and which belongs to the other Sam Venturas? That's the goal. And here's how I adopted it. Existing approaches for tackling this problem also use statistics, but they make one big mistake. They throw away a lot of information that could be useful in determining which records should be linked to each other. In my thesis, rather than throw this information away, I choose a framework for taking this information into account when determining which records should match. And the funny thing about it is I'm not even doing anything all that complicated. I'm using simple statistical concepts, such as taking the mean, the mean, the skew of the distribution. And then combining that with complex statistical models that learn what characteristics of these simple statistics are associated with matching records and non-matching records. In essence, I'm teaching the computer how to tell the difference between these two language organs. Now, I know what you're thinking. You're thinking, Sam, that's not that impressive at all. I get a pull in two seconds and these two guys want to say, well, that's great. But could you do it over and over and over and over again for billions of pairs like these? That's what the framework I choose from my thesis does. I combine simple and complex statistical methods to reduce the amount of linkage error in massive data sets. And the best part about it is I'm doing it all for you so that each of you gets proper attribution for your great work. Thank you.