 TheCube at Hadoop Summit 2014 is brought to you by Anchor Sponsor, Hortonworks. We do Hadoop. And headline sponsor, WAN Disco. We make Hadoop invincible. Welcome back to Hadoop Summit on TheCube. I'm Jeff Kelly with Wikibon. My next guest is Bill Yetman, VP of Commerce, Data and Analytics at Ancestry.com. Bill, welcome to TheCube. Hi, nice to be here. Well, thanks for coming on. We were talking earlier before we went on the air about your title. You've got three very important words in there, not least of which, VP, but Commerce, Data and Analytics. I think most of our viewers will be familiar with Ancestry.com, but tell us a little bit about what your role there is. For my teams and my part of the organization on the commerce side, I own the services and things that bring in all the financial money and keep track of the subscriptions, everything there. Data involves the data warehouse and our big data efforts. We're mushing those together and driving them together. Also under Data, we have the DNA project. I have the teams that do the ethnicity matching, the ethnicity and the DNA matching that we put on the site. We show genetic cousins, so if we're related and we find out that we're fourth cousins, it means there's a common ancestor 150 to 300 years ago and there's a high probability and a high 90% that we are related somewhere back there. And then for analytics, because I have the data warehouse and the data, I'm trying to feed the analytics to get actionable insights for the business. Well, I mean, there's so many cool use cases we could talk about. Why don't we start a little bit kind of digging into the role data plays at your organization? I mean, it's, I would imagine it's pretty critical. I mean, it's core to your business. It's critical. If you've never been on the site or you're not sure how the site works, you come to the site, start by building a family tree and once you can get back to your grandparents and we can connect you into the records, we can start showing the shaky leaves and the little hints that say what we know about this person from say the 1940 census, the 1930 census, we have birth, marriage, death records, but also user-contributed content. Other people have put up their family trees and we can find matches that are related and show those to you as well. So you may find a photo of a great grandfather, somebody that somebody else has put up that you've never seen before. And so it's a great way to do crowdsourcing. It's just really different. So, I mean, really you've got a, data has to be a core competency at your organization. Yeah, it does. So talk a little bit about how you go about doing that internally. How you, because there's a lot of interest, I think from, I guess, maybe organizations that might not be quite as data-centric, but a lot of interest from them about how do you build up a competency in analytics and big data? What's the approach? I mean, it might be a hard question to answer, actually, for ancestry.com. Because it's so core, it's, I would imagine, permeates all corners of your organization. But to the extent that you can, I mean, what are, how do you go about kind of cultivating that talent internally in the culture of data? One is we're hiring people and we're keeping the talent internal. We're not outsourcing it to other areas. We keep it within the organization. We're trying to grow it and grow that expertise and get better at it. That's one key thing. And then just the amount of data. At times, it can be overwhelming. And so how do you find those nuggets that you want to start with and what are those insights that you can make move? So if you sign up for a 14-day free trial, we found that if you do what we call 25 discoveries in the first 10 days, then you're likely to be more successful and stay with the service, right? So how can we get you to find more discoveries, right? And how can we lead you down that path? If you're struggling and you're stuck at 15, how do we get those last 10 to you and help you help you discover your past? So those are the kind of things that we're trying to do to help people. And if I can help the customers, if I take a focus on doing something that works well for the customers, the retention and the conversion and the revenue sort of takes care of itself. Put the customers first, for sure. Yeah, put the customers first. Let's talk about the types of data you're dealing with because I've actually been poked around Ancestry.com, you've got census data, you've got images you mentioned, all sorts of data. This is not your structured data world coming out of a CRM application or a financial application. You've got all sorts of data. Give us some examples of the kind of data sources that you've got to wrangle with. Well, you know, the National Archives. We're one of the few companies that can actually work closely with the National Archives and get some of the data, the documents and things from National Archives and get them scanned in, get them indexed and get them on the site. So, you know, if you look at just the record data, it's just fascinating to try. Have you ever looked at a census document? I've seen scans, I haven't really looked too closely. It's really interesting because every census writer had different handwriting and everybody, and some of them, their handwriting was really, really bad. You know, so you can see that there are problems with the transcriptions and how do you correct it and get it right. And we found that one of the ways you do it is you allow the customers who need it to know and say, that is my great grandfather, but that's not, his name is not Wilbur, it's William, right? And allow them to go in and change it. So, and you keep the original, but you show the alternatives and allow people to search by the alternatives and have those show up. So, there's a lot of little things like that. And if you just think about all that census data, how can you stitch it together and how do you make it, you know, because some people were in the 1910, 1930 and 1940 census, how can you tie them all together and follow those families. We've done some very interesting data visualization with some of this. Some, the data science team looked at immigration patterns from about 1700 to about 1950 and watched the lines go from different parts of Europe and different parts of Asia and back into the US. And you watch it over a time, time of period. It's a great visualization. Another thing that's interesting to look at is just trees and tree data. We have some people that have some really huge trees, hundreds of thousands of nodes, an individual person that they put up in their tree. I don't know how they had the time to go find all those people, but they have them. And you can look at them visually and sometimes they're just a rat's nest and other times they're laid out very nice and neat. And so just some of those quick visualizations can be really cool. Very nice. So tell us a little bit about the technologies you're using to handle this data because you mentioned you're bringing your data warehouse and some of your big data initiatives together. And there's a lot of talk at this conference about is big data is Hadoop going to replace the data warehouse as a complementary? What's the impact going to be on the data warehouse vendor community? What's your approach? We've had a traditional data warehouse for years and so we would be side by side. But we are moving to an MPP solution. We use what used to be Parasail, now Matrix from Axiom as for the data warehouse. Hadoop is starting to do a lot of our ETL. Almost all of our data is going into Hadoop and we're starting to ETL. And how do we aggregate it into Matrix? On top of that we use Tableau to expose it. Trying to, I'm really trying to get the organization to self-serve. And how can we very quickly get new data, new insights, new behaviors from the customers into the data warehouse and expose back out to the organization, to both product and marketing, to build a better product, build a better experience. I hope that makes sense. Yeah, absolutely. Yeah, I wish we had more time. We're about to wrap up. So just want to kind of give you the last word. What are your thoughts on this show? I mean, this is my third show and it's grown significantly over three years and it's 3,200 plus people from 1,000 companies. What's your take on kind of the vibe here? I think one of the things that I really like coming to this show and seeing is what people are doing with Hadoop. And I really like seeing the companies that get up and stand up and say, this is what we did right. This is where we had issues. This is where we changed and grew. And that's really what I like about this show and I'm surprised at how big it is this year. I missed last year so I was in 2012 and so I think it was a little over 2,200 then and it's 3,500 today. I'm surprised at the growth. And just the people that you walk around and talk to, a lot of people with great experience and having done a lot of things and then people that are just starting out. You see all types and the problems they're going after like us, we're big data to get our customer behavior but also for DNA. And so Hadoop is so flexible and allows you to solve so many problems. It's amazing. Well, fantastic. Well, Billy Edmund from ancestry.com. Thanks so much for coming on theCUBE. I hope it wasn't too painful. We appreciate it. We love practitioners telling their stories to our audience or other practitioners. So I really appreciate it. It was a ton of fun. Likewise. Thank you. So stay tuned. We'll be right back here live at Hadoop Summit in San Jose.