 Live from Boston, Massachusetts, extracting the signal from the noise. It's theCUBE, covering HP Big Data Conference 2015, brought to you by HP Software. Now, your hosts, John Furrier and Dave Vellante. Okay, welcome back everyone. We are live in Boston, Massachusetts for HP Big Data 2015. This is theCUBE. It's looking at Engels flagship program. We go out to the events and extract the signal from the noise. I'm John Furrier. My co is Dave Vellante. Next guest is David Abercrombie, principal data analytics engineer at Tapjoy. One of the most fastest growing mobile monetization analytics for developers. Really big part of the mobile revolution that really started eight years ago in my mind with the iPhone launch. And now it's exploding. It's relevant. It's the center of the conversation. And of course, under the hood is analytics and cloud and infrastructure. Welcome to theCUBE. Thank you. Pleasure to be here. Here's a taste of what the mobile revolution is. Before we get into some of the under the hood stuff, just your personal take in writing with Tapjoy. The growth has been fantastic. Some of the numbers you have, share some of the data. And then what's it been like to be on that surfboard, if you will, of this mobile tsunami of awesomeness? Yeah, it's been quite a ride. Tapjoy, we do mobile advertising. We have over 500 million active users each month. So we have a lot of users engaging with our ads. We do mobile ads, but it's not just advertising a banner ad that can annoy users. Or boring retargeting crap. Exactly. People engage with our ads to get rewards. Typically virtual currency in the game they're playing. So we have the kind of ads that people actually seek out and engage with at times. We have, Tapjoy has basically three kinds of customers. We have the app publishers who use our SDK to help monetize their app. We have the advertisers who need to get their message out. But then also the users who, we have to keep them happy. We have to show them relevant content, relevant ads, and make sure that the process is smooth. Yeah, and the growth has been phenomenal. I mean, you simply look around and everyone's got their mobile device. They're just ubiquitous. And you're in San Francisco, the center of the universe, when it comes to, I'm only kidding, being from the area. But you live there. No, but I mean, there's a cultural shift right now in San Francisco. Well, yeah, particularly with the data. The data community is so vibrant in the San Francisco area. The meetups, the user groups, just walking down the streets, people talking about data, it's in the air. I came into the big data world. Partly, a large part for that vibrancy, the new things. I had come from the older data technologies and their conferences were stodgy and old. But yeah, it's all happening and it's happening in San Francisco. And the amount of sharing and collaboration is really tremendous. And it's super relevant too. I mean, it's just, you can't get better than that. So you bring a great perspective then because you grew up in the world of traditional DBMS and now you're seeing this sort of new shift. I was struck by Stonebreaker's comments this morning essentially saying it's, well, it's just a sort of a new data warehouse. It's just a big junk drawer. Do you see it that way? Well, no, that's not the way I see it. Of course, people do have piles of junky data lying around. And in fact, I think that's one of our biggest challenges in the big data world. It's relatively easy to pump large volumes of data around and store large volumes of data and run queries against large volumes of data. But when it comes to what that data means, where the data comes from is the data conformal. Can you actually join it together? At Tapjo, we basically have two separate uses for our big data, which have quite different needs in this regard. Of course, balancing the needs of publishers, advertisers and users requires very sophisticated machine learning data science algorithms into the data so that we can, we know what personas are active in a particular app, that we know which ads are resonating with the users. And this is all done through machine learning, cluster analysis, regression analysis. So that's, and Tapjoy wouldn't be alive as a company without that sort of data mining, if you will. But then also, we have traditional old school BI. You know, old school star schema with standard grid reports of aggregated data. And it's critically important that data have good accuracy so the users have confidence. Our use of this data, it would be ignored if our users didn't have confidence and the confidence comes from accuracy. And if you just have a big bunch of data, disparate sources of data lake, it's very difficult to achieve that confidence, that crystal clarity that the modern BI tools allow you to slice and dice. And you know, it's called business intelligence, but really it's bug amplifiers. So is that a never-ending trade-off, David, between that precision, that data quality and that data flexibility, or are those two worlds going to ever come together? Well, I don't think that they'll come together. I mean, on the one hand, you've got apples and you've got oranges over there. And it's not like our apples are transforming into oranges. There's a use for both. You know, the machine learning stuff can deal with much more slop in the data. But when you're doing old school BI, you know, you need confidence, you need that accuracy. One very interesting way I think that conflict is really apparent, you know, Kafka and queuing systems and Spark are a big topic of what people are talking about now. How to pump large volumes of data into Vertica is a very hot topic. Relatively straightforward to do. It's a matter of getting the technology to work. When it comes to highly accurate BI type reporting, you need the metadata. You know, so much of these big data streams are based on IDs, you know, the app ID, the user ID, the advertising ID. And that's what's coming in through the big data stream. But to make sense of that, you need to know which app belongs to which publisher and you need the metadata. And the metadata data transfer is much more difficult, I think. It's an engineering challenge. It's an engineering challenge. It's a logical challenge. It's a challenge of semantics and it's a challenge of, you know, instrumentation. It's more logical. And your typical data engineer, your typical ETL engineer, it's not really steeped in the subtle meanings of data. For instance, country. You know, country is an important attribute for almost any sort of clickstream use like this. Well, and an ETL engineer may see a column called country in some data source. Okay, there it is, country. Job is done. They've brought country into the system. Well, country is subtle. Let's say you're an American traveling in Canada. You know, are you in Canada or are you an American? You know, the concept of country is difficult. Well, the news Snapchat just announced that they're turning off that autoplay because a lot of complaints have come in like my daughter when she's traveling outside the country or the bandwidth charges. So no, there's just some intelligence, a systems of intelligence that's emerging. What is happening there? Cause I want to get your take on this because you met, you said it earlier, boring banner ads and stupid, I said stupid retargeting. Maybe you guys do a retargeting but banner ads of old is killed or dying, slow death. It's engagement. It's intelligence, contextual relevance at the right time to the user in context to what they're doing. And a seamless experience, you know, so that the look and feel of the app is consistent with the look and feel of the app. How do you do that? How do you build a systems of intelligence with the kind of latency and performance to give the right thing to the right user at the right time, at the right place? Well, that's a very difficult thing. Of course, that's the $64 question. And what we're, now we're talking in the world of kind of real time ad optimization for which we don't use a big data, vertical data system. We used to do that through a system of pre-computing certain data elements and serving them up through HBase, a Hadoop product. But we're moving more and more into the world of SQL, real time in-memory SQL databases like Mem SQL, where we can express our business rules and our algorithms in SQL. You know, and that's, has a SQL person. SQL gives me comfort, I like SQL. And all these no SQL systems. It's the language you're comfortable with. It's the killer app of big data. Exactly, it's expressive, it's flexible, it's transparent. You know, and so we're shifting over real time ad optimization to SQL logic. So a lot of stuff's going on, you can do prep, you can all kinds of ways to prep the data, but ultimately having it available in real time is the key. Or not. Well yes, and actually minimizing the prep when we would do our pre-compute and serve it up through HBase, it required programming and aggregation in advance, but now using SQL to do this, we can bring the data in raw and do our aggregation in real time over large volumes of data. So one of the big use cases of course in so-called big data has been sort of this ETL offload of the enterprise data warehouse. Presume you've done some of that, a lot of that. What are you doing there? Where does Vertica actually fit? Are you offloading Vertica? Or are you offloading the traditional data warehouse? How does that all work? Well, Vertica is our traditional data warehouse. Okay, there you go. So are you, yeah, where does it fit in this whole Hadoop movement? And well, of course we've been around for many years, so our legacy ETL data flow was, first the data was queued into Hadoop, and Hadoop is where our machine learning the heart of our data systems comes. And then from there, we brought it into Vertica for easier ad hoc analysis into the operational data store using plain old SQL. Every table we have in Vertica, we also have in Hadoop. Of course it's one of our little conflicts within the company, do we want to do this in Hadoop or in Vertica? But Vertica usually wins out because the expressive needs SQL and the ease of use. And then also Vertica, then from that we populate our star schema straight within the Vertica database of pure SQL ETL. And there's performance implications as well, presumably. Oh yeah. Yeah, okay. We know you got a hard stop, really appreciate coming on theCUBE. I know you got a role, but I want to ask one final question. Share with the audience in your own words. Okay, take your Tapjoy hat off. You know, data geek, engineer, you've been around the block. What's going on? What is actually the most important thing happening in this big data world? The data science world, the application development world. We're all seeing Agile out there. Oh, apps are running the show, infrastructure. What's the big intoxicating attraction that's attracting a lot of heavy hitters of this world? Well, of course the ability to expand your horizons and do things with data you couldn't do before. What, the trend that really warms my heart is more and more SQL. You know, more and more SQL on Hadoop. You know, I love SQL, SQL's expressive. You know, the relational model's been around since 1970. I think we're finally getting it right now on the data volumes to account. So SQL's a new abstraction layer, reborn. Rebirth of SQL, back, never going away. Thanks so much for sharing. This is theCUBE, we're live in Boston, Massachusetts, extracting the signal from the noise share and the data with you here at the big data event. We'll be right back after this short break.