 Live from Boston, Massachusetts. Extracting the signal from the noise. It's theCUBE, covering HP Big Data Conference 2015. Brought to you by HP Software. Now, your host, Dave Vellante. Welcome back to Boston, Massachusetts, everybody. I'm Dave Vellante of Wikibon and wikibon.com. I'm here with Kevin Goode, who's the director of platform engineering at Inmar. And we're here at the HP Big Data Conference 2015. Kevin, welcome to theCUBE. Thank you, Dave. So, we were talking off camera this is your first year at this event. Yes. What got you here? Well, the folks from Voltage asked me to come up and present and talk about how we use Voltage and our Hadoop cluster to do some of the healthcare data analysis that we want to do. Yeah, so that's going to be a good conversation. I want to dig into that. We heard a lot of conversations yesterday about Hadoop and all the big data hype and how Hadoop is not big data. So let's just have a little dig into some of that. Okay. As a practitioner of Hadoop, you were probably taking it back a little bit, but I'll come back to that. I want to talk more about Inmar. Tell us about Inmar as an organization that's been around for three or four decades. What's Inmar all about and what's your role there? Inmar started 35 years ago as a coupon processor, coupon redemption company, acting as the trusted middleman between retailers and manufacturers. And we've kind of evolved over the years. The retailers ask us to start handling their pharmacy claims as they all started to put pharmacies inside of the grocery stores. It was another type of receivable and they said, Hey, can you do this too? So we start doing that. And then we also start doing return goods for them. So the best way to describe return goods and reverse logistics is if you go into your grocery store on Halloween, you've got a shelf full of candy. You go back the next day and that's all gone and all the Christmas stuff is out. Well, all the candy has to go somewhere and it comes to a company like us and we count it and credit a store for sending it in, try and sell it off to the big lots or the dollar store or whoever just dispose of it. So mostly we've been a financial transaction middleman between retailers and manufacturers across three different early lines of business. It's a version of water falling in retail. Exactly. That's a fascinating story about a company who transformed from sort of a legacy old line business to a modern business. I heard a story this morning on the radio about Columbia records. You're probably too young to remember Columbia records. But when I was a kid, you'd get mail, all kinds of spam physical mail about, Hey, sign up for our record of the month club. You'd get for pennies, you'd get an album and every month you'd get a new album for a lot higher price. And Columbia records was a billion dollar plus company and last year they had revenues of 17 million. They just went out of business. In Mar, I didn't go out of business. Instead it transformed. I wonder if you could talk about that transformation, how it took place and what role you and your team played. Right. So we decided, our CEO decided about five years ago that we needed to transition from being just a business outsourcing company to really the data driven enterprise. And we started down our journey and I was tasked with going out and starting to look at big data, how we could leverage big data to take all of the transactions that we generate from our business outsourcing business and start to create insight. And so we started to go down the Hadoop path really about three years ago. Built a cluster, picked some vendors and we've started moving data over and hiring data scientists and data engineers and starting to drive insight. And we actually have products that are in production coming out of our Hadoop cluster, which I've found that a lot of people talk about POCs but never really get to, hey, we're making money doing it. One of the things Ken Rudin from Facebook said today is the big data is not just about insights. It's about being able to take action and actually take that action and affect business change. Five years ago, in 2010, most people didn't even know what Hadoop was. And so we heard yesterday is all a bunch of hype. That was Mike Stonebreaker being Mike Stonebreaker. And today, big data is not Hadoop. And while I would agree with some of those things, take us back to 2010. Hadoop changed the way in which you looked at data and were able to store and process data and the economics of data, didn't it? It absolutely did, yeah. So when I was looking at everything that we could go with for analytics, we looked at Knotiza, we looked at Green Plum, we looked at SQL Server Parallel Data Warehouse and all of those were in the multi-million dollar range just to kind of get started and see if it even worked for us. I built my cluster for around $50,000 and with another $50,000 invested in services from Hortonworks, we had something up and running and started to actually generate actionable items out of it. And Vertica was or was not part of that equation? We didn't look at Vertica at the time. So it was just pure Hadoop, white box stuff, shipping function to data, which is the epiphany of Hadoop. The old days, he used to shove a bunch of data into a Unix box, buy some Oracle licenses and if you had any money left over, he might get to a project. Yeah, and we decided I wanted to spend my money on engineers instead of on licenses. So Hadoop in that sense was a profound catalyst for business change, was it not? Yes, it absolutely is. And so then, but then of course, I mean, and I think it's fair what some of the speakers are saying at this event, it's not just all about Hadoop, you probably found okay, there's more that we can do. Maybe take us through that journey. Right, so we're finding Hadoop is good at doing a bunch of crunching and doing the deep analysis, but it's not good for ad hoc, it's not good for folks who just want to go in and get an answer really quick. So we're finding that we will do this kind of big batch processing and push out to other systems like a Postgres or we use Couchbase and start to build APIs that we can then deliver as data products to the rest of our enterprise. So is that where Vertica kind of came in as well is trying to filter some of the data and put it to Vertica or did Vertica come in just more recently? So Vertica is something that we're considering, but really what kind of brought me into the HP fold was using voltage. Oh, talk more about that. So big part of what we do is this pharmacy claims reconciliation and so that's all HIPAA data. And we know that we want to dig into the HIPAA data because that's a huge, huge market. There's a lot of money out there on the table, especially with the Affordable Care Act and people needing to move away from fee for service to more of a performance-based metric system. So I can't put protected data in Hadoop and have my data scientists go through and start figuring out algorithms. So I need to use voltage to do reversible encryption so I can put de-identified data into Hadoop. My data scientists can do all of the stuff that they do, come up with their algorithms, come up with the results set at the end, then I can move that back over into a protected system, re-identify it and then we can act on it. So you're talking about, you know, you're considering Vertica. Where would an MPP database like Vertica fit in your architecture? It would probably fit more in the reporting side. So, you know, some of our clients, so Walmart is one of our clients. Whenever you use Walmart, your scale goes up by a lot. And I've considered using Vertica to load some of our healthcare data where they want to do reporting and they want to report by 40 different, have 40 different slicers on the data. That's just outgrown what we can do with SQL Server and I need to find something that performs better. So your main EDW is running on SQL Server today? Is that right? Right. Okay, so, I mean, what a lot of people are doing is sort of ETL offloads to an MPP database and that's sort of what you're considering. Yep. Where do you see the analytics piece? Where is that in the whole data pipeline? Is that sort of embedded into the database? Is it, have you built custom applications? We have traditionally built custom applications across the board because we generally set everything up as a multi-tenant. So, in our healthcare space, we have 90 clients and really don't want to have 90 separate databases and have people come in and connect directly to the database. So it's usually web apps that our clients can log into and then we handle security at that point and give them a consistent look and feel but inside the web apps, it's very analytics driven and you can have like 40 different filters and go find data however you want to find it. So, Colin Mahoney, I don't know if you saw his talk yesterday but he had a chart up there. He showed the ERP days, the original ERP days highly customized and then moving into packaged apps, you know, SAP and Oracle financials, et cetera, et cetera. And then I thought he was going to take us down a similar path saying, you know, big data's going to go the same way. It's highly customized today just like you're doing. And I thought he was going to say it's got to move toward packaged apps. But what he said is he used this notion of composable apps, flexible pieces that you can sensually put together to build what is a quasi customized app. Is that where you see it going? Yeah, that's really what we're working on is with the dashboards that we're building we're trying to make sure that everything is kind of widgetized, if you will, so that we can have the kind of toolkit so a power user from any client can come in and start pulling widgets onto their dashboard and get whatever kind of view they want into the data. We're really trying to treat what we're doing as more of a SaaS model. So, you know, we want to provide the dashboard, let you customize it. We can pull in whatever components you want to. So, obviously, big user of open source. Yes. A lot of it you're not paying for, which is great, helps with the economics. Yep. But then there's the support issue. So you're using Hortonworks distro. Are you a subscriber? I am a subscriber, if there's. And the reason we went with Hortonworks was because it is just a server subscription. It's not done having closed source pieces. And I remind my sales guide, Hortonworks, pre-regularly that I expect you to be a valuable partner and that we should have an ongoing conversation and we should be meeting probably quarterly and working as partners and not just you showing up once a year going, hey, can you cut me a check? Well, it's interesting to watch that whole sort of distro movement. It moves cloud air all alone and then all of a sudden Hortonworks comes in and then everybody had a distro. I think SiliconANGLE announced the dupe distro at one point for kicks. And you see the sort of big three, Cloud Air, Hortonworks, and MapR, all of them, proponents of open source, all of them contributors to open source, but the definition of open has short shifted and changed in the last 15 years, hasn't it? Yeah, it certainly has. When we looked at Cloud Air, I noticed that their manager, which they kind of touted as being their big thing, had closed source bits in it and MapR has closed source drivers for their file system that's really kind of what drove me towards Hortonworks. So you're a proponent of a, the more open the better, okay. But the flip side of that is, there's function and integration in theory that you can get from a closed source approach. Do you not buy that? Do you feel like open source has caught up or will eventually catch up? What's your thinking on that? I think there is some value there, but I really need to see the value and again I want to make sure that whatever vendor I choose if they are going to maintain some of his closed source is a partner and it is going to work closely with me and provide more value than just, well here's some code that's closed source that you can't use unless you pay them. Yeah, so obviously you work with a lot of proprietary software vendors and Microsoft as an example, but they're proven and so how do you make the decision open, closed? How do you make the decision in general for big data analytics in terms of what to buy, what to build, you know? My first inclination is usually to look at the open source stack and see if there is a reason that it can't work, that there's a reason that it's not able to fulfill what we want to do. And from there I'll start kind of branching out and again I don't mind paying a vendor as long as a vendor is actively engaged. I don't want to pay somebody a lot of money every year just to talk to them once a year. Mm-hmm. Now we've heard a lot of talk about Spark at this event, Kafka, big open source initiatives that are getting a lot of press and a lot of attention. What are you doing there? We're looking at Spark and Storm and Kafka right now for our digital platform, which we know that paper coupons eventually will start to fade away and the digital is kind of the future in that space. We want to do real-time streaming and so we are working on that front end application that handles the coupon redemption to turn it into more of a PubSub message-based system and then be able to stream that data down. We've looked at Kinesis through AWS because we are pretty heavy in AWS for our digital space and looking at Storm, since it is part of the Hortonworks stack and Spark, trying to figure out which one of those makes the most sense and the kind of crazy thing is if I talk to Cloudera, then Spark is the thing and Storm is terrible. If I talk to Hortonworks, Storm is the way to go. So yeah, we had spent a lot of time kind of weeding through the marketing material and figure out what really makes the most sense. It's interesting. So I mean, you have a lot of, essentially when you mention Amazon, so let me start there. So if you look at Amazon, Google and Microsoft Azure, it appears that they're building out a data management layer that's integrated and they're delivering it as a service. Amazon turned the data center into an API, essentially. But it's arguably less functional from a data analytics standpoint than what you could get through building, putting cobbling together open source pieces and putting in your own customizations. But the trend seems to be going toward a more integrated approach, even with projects like Tungsten. What's your thought on that? That sort of simplicity, integrated, cloud, but maybe a little less functional versus sort of the more bespoke, maybe a little bit more complicated, but more functional. I'm open to either one. I've actually recently kind of looked at Azure and at AWS for Hadoop. My CTO kept saying, how come we're not doing Hadoop in the cloud? GMR is there, you mentioned Kinesis. So I ran the numbers and it was costing me about 2,000 a month on-prem. And to do the equivalent cluster, out in Azure was about 11,000 a month and AWS was about 9,000 a month. And again, when you get right down to it, I would prefer to spend that difference in money on good engineers so that we can build good stuff. So was the cost a function of a lot of different things? Was it mostly bandwidth? Because you're moving data around or storage and- It really wasn't bandwidth at all because a lot of what we're doing is pulling the data from AWS down to our Hadoop cluster to do the analytics on the digital promotion side. So it was compute and storage. It was all compute and storage. And the pizza boxes that I bought three years ago cost is just about nothing. Wow, so that's a stark difference. Now, you're doing full total cost of ownership. You're including your people and the time. Yeah, rack space, data center space. I mean, I know what we pay per rack unit inside of our rack, inside of our data center. So I added all of that in and I still came out a lot cheaper on-prem. But you do use AWS for certain use cases. What are those? That's for our digital promotions network. And what we've really kind of decided to go cloud is when we have an app that needs to be up 24-7. We don't have multiple data centers. We have a single data center. So for the stuff where we're interacting at a point-of-sale system at a retailer, we want that up 24-7 high bandwidth and we found AWS does that best. So actually that's interesting. So the requirement there is higher availability, application availability. And the cloud is your solution there. Interesting. Last question, just this event, it's your first year at this event. Thoughts on what you've seen, what you're getting out of HP Big Data Conference? I've really enjoyed it. And I think what I've really gotten out of it is since I've been considering Vertica, being able to talk to some of the other HP customers that are actually running Vertica and say, what do you find and works well? What doesn't work well? What's your experience been? Has been the most valuable to me so far. Kevin, great segment. Thanks very much for coming on theCUBE. Really appreciate you. I always love the practitioner perspective, cut through the BS and tell it like it is. So thanks very much. Okay, well thanks for having me on. You're welcome. Keep it right there, buddy. We'll be back with our next guest. This is theCUBE. We're live from Boston, HP Big Data Conference 2015. Right back.