 Live from Boston, Massachusetts, extracting the signal from the noise. It's theCUBE, covering HP Big Data Conference 2015, brought to you by HP Software. Now, your hosts, John Furrier and Dave Vellante. Okay, welcome back everyone. We are live in Boston, Massachusetts for HP Big Data Conference. Special presentation of SiliconANGLE's theCUBE, our flagship program. We go out to the events and extract the signal and noise. I'm John Furrier, SiliconANGLE. I'm your host, Dave Vellante with wikibon.com. Our next guest is Bill Teisinger, VP of Engineering at yellowpages.com. Welcome to theCUBE. Thank you. So, VP of Engineering, yellowpages.com, must have a ton of data. I mean, huge big data. Tell us, give us a little bit of background on how big the data is. What's the engineering culture look like? Sure, so, you know, yellowpages.com, yp.com is a local marketing solutions provider. We're based in Los Angeles. So, right now, you know, we are helping small businesses and consumers kind of get together. We have about half a million small businesses as advertisers with YP, and across about 20 million business listings in the 4,600 categories. So, that does generate a good amount of data. We capture about three billion events a day and process it in our infrastructure. All right, so, take us through what a day in the life of, or a month or a week or a life cycle. A lot of ingestion. Yes. A lot of data you're going through. Take us through some of the core problem areas, opportunities where you guys are kicking, but what's going on? Sure, well, we have a lot of integration, of course. We integrate with about 50 different data sources, both yp.com, all the YP native apps, and through our partner networks. We partner with Yahoo and Bing and others. We capture a lot of that data. So, clicks, impressions, searches, traffic, and clickstream data. We also capture data from our native internal applications. So, business applications, back office systems, things of that nature. And we have to integrate all that data so we can tell a complete story about our advertisers, about our consumers, and also about the behavior internally within YP. So, you have data pipes into all those networks. So, for instance, Yahoo and Bing, you're connected. Are they co-locating data dumps over there, or is it direct feeds coming back and forth? No, we have a data collection framework that actually will capture searches and clicks and impressions from their systems, right, and their platforms, and we'll capture them in our network and process them. Got it, so you're getting all that clickstream data. Okay, so what's the biggest challenge that you have? And what's that, you had to boil it up, top three, from an engineering into business. How do you tie it all together? The biggest challenge we have right now is we have roughly four, five petabytes of data that are sitting there. We're processing it, we do a really good job of processing it, we do a really good job of collecting it. We're doing a very poor job of being able to get that information to analysts and product managers and people within the company. So, we are continuing looking for tools that help us do that. Yeah, because you've got a huge ingestion. Yes. So, you're storing the big piles. We got the data, we've mastered that, we've gotten pretty tight SLAs, we've mastered that, we have a pretty sound infrastructure, we've mastered that. Now we really need to master the information piece of this. How do we get that data to the analysts, to the product owner in the company, to the data scientist? So, that's the next level, that's the next level of innovation you guys are going to. Yes. Working on data, low latency, search. Yep. All the hard stuff. All the hard stuff. It's really hard. It is hard. Explain how why it's hard. A lot of folks go, oh, I've got the pile of data and all the things prepared. Why is it so hard? Nothing looks the same, right? We always say unstructured and you can capture unstructured data, but at some point there needs to be structured data. Otherwise you can't really analyze it and report on it. So, that's a challenge in and of itself. How do you do that, right? We have a lot of systems we integrate that are unstructured initially and then we have to turn structured into them. A good example is that something like salesforce.com. We use Salesforce in our company and we integrate Salesforce through HBase using their APIs to load data, right? But that HBase data unstructured needs to turn to structure at some point and be able to provide reporting to people. And that's a challenge when you take that one unique use case and then spread it across 70 different use cases, all of a sudden you're really, you know, trying to deal with things at scale. And there's a lot of variety in the environment. On HBase itself is a moving trend because you've got to write your own libraries, a lot of custom stuff going on. Yes, HBase is. But once you master it, it's awesome. I'll never say master, never say master. So, how long have you been with the company? I've been with YP since 2010. I, prior to that, I owned a consulting company and I was, and YP was a customer of mine. And I ended up taking a full-time gig there so I can get a lot of hands-on experience with, you know, big data and the platforms that are there. So I've been there about five years now. Okay, but so you obviously know a lot about the history of the company as well. I mean, here's a long time company that has to totally transform. I mean, sort of being a dead tree company to an online marketing powerhouse. It has absolutely gone down that path. Can you talk about that transformation culturally? You guys must talk about it all the time. I mean, it's, you know, I mean, Yellow Pages started in the 1800s. Yeah, that's how I was dispatched a minute ago. It's been over a hundred years. You know, it's phone books. We just recently spun off the, you know, story, right? They ran out of yellow, the white paper, right? And they put in yellow paper. That's how the Yellow Pages started. No more paper. And so my boss, the CTO, Darren Clark, has done a really good job of building a very strong core technology team in Glendale. There's about 350 engineers there. A lot of them have really solid backgrounds from Yahoo and other strong tech companies. We've built our own internal search systems. We've built our own data platforms. So it's a very good core engineering team that's there. So now the focus has been, how do we transform what looks like a print business into what really is a digital business? And we've been going through that exercise for the last three years. And it's been a, it's a challenge and opportunity. It's been a lot of fun. And so when you have little successes, that breeds culture, like I think success breeds culture. So when you build systems and tools and you can see them actually impacting the marketplace and helping consumers and helping businesses, then you start to get something out of that. And that's really all the engineers want. They want to see and build something that people use and find useful and helps them, right? And you're, a lot of your customers and partners are also large scale, Yahoo, Bing. I mean, they're not screwing around either. We have massive petabytes, tons of data. They have a lot of data. So I used to work at Yahoo back in the day before I left there, 2005. And then went back to working at Idealab, Bill Gross's company, and was there for about a year before I started my own consulting business. And then went to YP, which was at the time AT&T Interactive. So we were owned 100% by AT&T. And in 2012, AT&T divested 53% to Cerberus Capital Management out of New York. And then AT&T does own 47% of us. So we have, we serve two masters. So, when thinking about that digital transformation, I mean, the conversation must have started with, all right, what are our assets, you know, and how do we transform those into a digital world? So you obviously get the data. Yep, we have the data. Everything and everybody, businesses, people. Right. And then you got advertisers and we have relationships with them. Absolutely. You've always been about putting those together but how did you sort of transform that model into the digital world? What does it look like today and how have you enhanced it to take advantage of digital? Yeah, it's kind of a voice of the customer in a lot of ways. So what's out there in the marketplace? What do small businesses really need help with? Sometimes it's presence management, which is something we've been developing in-house and we offer as a product. Sometimes it's search engine marketing. We offer that as a product through YP Search. So we try to really encompass like the whole landscape, right? We want to be the solution provider. So as a small business, you may need direct mail. You may need print for that matter. Some small businesses do really want print as a medium for getting consumers. You may want search engine marketing. You may want presence. We want to bundle all those tools together so that as a small business, you have one voice, one expert. You can go to that expert and we can then help you, right? Solve for all these problems. So describe your sort of data architecture and how that's evolved. Presume you had a traditional, maybe still have a traditional EDW. Is that using Hadoop, what are you doing? Are you doing ETL offloads from that EDW? Are you replacing that? Where does it, are you using Vertica? Where does that fit in? Just paint a picture for us. Absolutely. So when I arrived, there was a lot of SQL Server in-house. A lot of events were getting dropped on the floor. SLA's were a mess, right? So the common mindset at the time, this was 2009, 10 was, hey, there's this Hadoop thing out there. Let's try that, okay? So we stepped in, my company stepped in and we built a Hadoop infrastructure to process the data. And got that up and running, got it reliable. And then started looking at the warehousing platform. And that's where Vertica comes in. So right now what we do is we do a lot of data collection. We process everything in our Hadoop environment. We store it there, expose it through tools like Hive. And then push that data into Vertica for ultimately for reporting on platforms or just for analysts to go there and write their own queries and get their own data. And so that kind of lends itself to a big challenge. How do we expose the data now that's in the Hadoop into something that analysts can use? So prior to Hadoop, when you had conversations with practitioners, they kind of went about data warehouse. They kind of went something like this, which we're constantly evolving our data warehouse. We can't keep up. I always say it's a snake swallowing a basketball. Every time Intel comes out with a new microprocessor, we try to make our stuff go faster and throw hardware at the problem. It's just this never ending battle. I presume you know that story well. And Hadoop comes in and you say, okay, now we have this inexpensive filtering system. Oh, I wouldn't call it that. Okay, so that's what I get to. I wouldn't call it inexpensive. Well, at first it sort of looked alluring. Right, yes. And then what happens is you kind of get sucked into the vortex of open source. Absolutely. All of a sudden you got this other tiger by the tail. Open source does not mean open this, right? So describe how that affected sort of the economics of data and sort of the processes and what does that mean for your strategy going forward? Well, you know, it became really apparent immediately. First and foremost, you had to build a very, very strong technical team, right? From platform engineers to ops people, from dev ops. And you had to have the whole gamut of people. So you're really trading what at the time would have been a cost in software and licensing with something that's proprietary with a cost and trade off of people, which is fine if you can get them, right? So that was one of the challenges you have to, you have to walk into that knowing I need a core solid team to be able to use this infrastructure. Now when there's so many tools out there, you have Spark, you have Impala, you have SQL and Hadoop, you have MapReduce, these machines aren't cheap anymore, right? They're all requiring a lot of resources to run. Everybody in the company wants to use their tool of choice, which means you have to support it, which means you've kind of moved into a space which is a little bit more of an investment now. You're buying machines that are $15,000, not $8000. And so when you do that across a large bed of data and machines, then it becomes a fiscal conversation at that point. And you're buying support subscriptions, or you're buying management frameworks, or you're going up the stacks. There's a lot of tools out there you need to help manage it. Absolutely. So what did you make of, I don't know if you saw the keynotes this week. Yes. Love Michael Stonbrecher. I said, okay, I was going to ask him. He was slinging it, wasn't he? So Stonbrecher, we love him too, he came in the cube, a couple of times he's been in the cube. Yeah, he laid it out. He puts it out there, but one of the things he said is that all this big data stuff, it's all a bunch of BS, you know, Hadoop this, Hadoop that, it's all about the data warehouse. Now, as a practitioner, you've added some value with Hadoop. So what'd you make of that? And I'll just parse through that. There is value there in that, what was once not feasible for a lot of companies, not the Googles of the world, obviously the smaller businesses like ourselves that have a big data problem in some shape or form, what wasn't feasible at the time is feasible through using Hadoop. But it poses a challenge now. You don't just need data, you need information and insights. And that is really, absolutely. I mean, you can store everything into Hadoop, it's a great pile, you can store it, all that ingest, great. And then when you really want to act on it, you got to engineer the hell out of it, right? You take us through that. And I think that that's what we're seeing with the tools that are evolving, right? So, and then to his point, like every year a new tool comes out, right? And people are adopting it because it does add some value, and then another tool comes out, and then another tool comes out. So where is the real value? And that's kind of hard just to kind of sift through all the noise to find what is really valuable in terms of tools that you can then want to put out there. There's so many, I could spend my entire team's time evaluating tools or flavors of the day. And there's also the Lego block opportunity where you, why build when, why buy when you could, well, why build when you could buy? How many expression goes? So everyone's, how many streaming engines are out there right now? I mean, why build one? So Kafka, all these things are talking about. So I got to answer the question, just to tee up a good question. There's a ton of noise out there. So one of the things we get all the time from our CUBE audiences, hey, just bottom line me. Is Hadoop dead? And is it spark as the answer? So a lot of Hadoop spark going on, fud. Spark's hyped up right now. You know, we know why we see people using that. Okay, in memory, I get that. That's cool. Hadoop is great. I mean, doesn't seem to be going away. We were saying they're not mutually exclusive. But what's your take? I mean, Hadoop has been useful for you guys. Absolutely been useful. So my take is always pick the right tool for the job. And sometimes that's going to lead you towards something like Hadoop. Sometimes sparks the job. Now, my environment is a production environment. I don't have the true option of trying to beta test something in my production environment. Hadoop, I wouldn't necessarily call it dead, but it's definitely mature. And sometimes that's a good thing. Sometimes it can be a bad thing. Sometimes it's a good thing. We don't really want to rely on a lot of mapper. Is it relevant or irrelevant? I think it's relevant to opinion upon your use case. We are probably going to end up migrating off of using MapReduce jobs sometime in the near future. I have, I love using Vertica. I think it's a very powerful database platform. I like relying on it for things. We've just completed beta testing of Vertica SQL on Hadoop. We're still going through that process, but it shows very, very, very good performance. And it's very promising as a tool that'll allow me get past that one hurdle, which is how do people get data out of that? It's SQL on Hadoop. It's not killing Hadoop. It's just making Hadoop more accessible. I mean, SQL, I mean, SQL SQL, right? I mean, people go SQL. Absolutely, but then you start thinking about all I'm using Hadoop for is HDFS at that point. Yeah. There's a stone break it was saying. Exactly, and it's right, right? So, and Spark may be a good option for some folks to opinion upon what they want to do. I'm going to continue to put my resources and effort around SQL on Hadoop, the Vertica SQL on Hadoop, to see what it can do for us. Because I think it's going to be very hard. You don't have the luxury. You have production. You have large scale. I mean, you don't have time to screw around. You got to get the job done. Yes, absolutely. I mean, you have SLAs, you've had data, and it's active data, so interesting. Yeah. Okay, so what's the biggest surprise in the industry? Let's take a step back. A lot of noise out there. Where's the signal? And what's surprising you? What's not surprising you? I mean, what's going on in today's world? From a practitioner standpoint, what's real? In terms of what's real? In terms of what's meat on the bone? What surprises me is that there's not a lot. I mean, it seems like there's a lot of BI and reporting vendors out there, but to me, it doesn't seem like there's a lot. So I have three different reporting solutions in my environment, which seems kind of odd. So we use Tableau for discovery and visualization. We're using information builders for enterprise reporting. We're using Vertica SQL on Hadoop for some things. We have R over here. We have SAS over there. It just doesn't seem like there's a lot of innovation in that area sometimes to take advantage of stuff. So that seems a little surprising to me. There's a little bit of innovation, but not as much as there are in other areas like streaming and SQL process and SQL streaming and Kafka and messaging. And those things seem to be where people are focusing a lot of their time and attention. Do you think it's just an evolution? It's the progression of the market right now because there's a ton of, I mean, here at this conference, there's a lot of engineering going on. When I say engineer, I don't mean development. I mean, engineering is development. You can say it's not a developer conference. It's not like, hey, we're banging out some new code. It's not a software thing. It's really the bigger software and engineering. So is this just a case of, you know, market evolution? Just, I, yeah, I really think it is. I mean, we are only 20, 30 years from what people would have considered to be traditional warehousing practices and principles, you know, inman, Kimball, blah, blah, blah. I mean, and so we're not far removed from that. Some people can't really get their head around the fact that warehouses look different today. What is a data warehouse? Is it a single software platform? Is it the whole Kitten Kaboodle? Like I tend to look at it as the whole Kitten Kaboodle. Because you should be able to get data insights anywhere along the chain. So it's interesting, you know, Dave and I were talking on theCUBE a couple of events ago. It's like, a lot of guys who are selling the products, the suppliers to people in the field, customers, they don't have a data problem. Their data, they provide the scaffolding and the, you know, the apparatus of software, but they're actually not living the data problem. So I got to ask you the question. It's kind of out there, but as people get data full, I mean, you're an example of, you're already, you're living data in the Genome Center in New York. They got data coming out their ears, right? So this is an example of data full. They have data problems today. Large scale guys like Yahoo, you've been there, you know, all those web guys from Gen One, they had data problem, they built their own solutions. But now as the enterprises start to become data full, what advice do you give those guys? Because it's new territory for them. I mean, you've been living it. You've been in that whitewater rapids for like multiple years. I mean, so you know that. So what's your advice to those guys? What should they expect? They, you know, it's also some of the challenges that most companies have. So, you know, you have challenges in the expertise it takes to be able to handle all the Vs of big data, right? But then you have these challenges now on the other end of the stick. The product managers, the analysts who need to think in terms of data. And sometimes that can be a little bit of a challenge. And in giving them the tools they need to be able to do that is a little bit of a challenge. So I would say, be mindful of the fact it's an enterprise change. It's not a big data platform team that's going to come in and make things happen. A new BI tool that does magic. Yeah, the BI tool is not going to make it happen. It's a complete enterprise and cultural change, right? The whole enterprise has to think data and be data driven. If only a part of it doesn't do that, the thing kind of collapses. And it doesn't work out as well. Yeah. Great, thanks so much, Bill, for coming on theCUBE. Thanks for sharing your insights. Obviously you're a veteran in the business and both large scale. I mean, I guess you were doing DevOps before DevOps was called DevOps, you know? I love DevOps. I love DevOps. We do too. My favorite thing in the world, honestly. Yeah, now we can definitely have you on as a guest host on our next segment. So again, we think DevOps is powering analytics. Absolutely. No brainer. And again, data full, this is the new concept. So, Mr. Cube, bringing you more data live here in Boston, HP's big data event, special presentation of theCUBE will be racked at this short break.