 Live from Boston, Massachusetts. Extracting the signal from the noise. It's theCUBE, covering HP Big Data Conference 2015. Brought to you by HP Software. Now your hosts, John Furrier and Dave Vellante. Okay, welcome back everyone. We are live in Boston, Massachusetts for theCUBE at HP Big Data 2015. Hashtag HP Big Data 2015. Or go to crowdchat.net slash HP Big Data 2015. I'm John Furrier. With Dave Vellante, our next guest is Chris CB Bone, Senior Database Engineer at SD. Welcome back to theCUBE. Thank you very much. So SD, hot in the news, beating earnings, growing. Last time we talked, you gave an awesome talk about the large scale. We're back. Large scale is back and hot. It's gotten even bigger. It's gotten even larger. But great keynote speech by Stonebreaker, a conversation with Colin. We thought about a couple things, variety of data. I focused in on that. Yeah, volume, velocity, that's cool. It's happening bigger, faster. But the variety piece is interesting. I want to get your take on where all this fits in. How do people manage the data analytics with the large scale build out of cloud on-prem? You know, it's a tough problem. There's a lot of, at Etsy, we've got a lot of different data sources and data repositories. We've got Hadoop, we've got Vertica, and we're trying to unify them all into one manageable piece, but it's hard to do that because they're all sort of proprietary and they all grew up organically on their own. And our job to make it more manageable is to bring them together. What's the engineering conversation? It's all the guys that's sitting around, kind of really looking at this from multiple perspectives. How do you guys frame it? I mean, a lot of stuff's going on. You have different database. It's all over the place. You have different platforms and technologies. I mean, unification, okay, that's a high level goal, guiding principle. How do you frame it? How do you start engineering that? I mean, what do you take tackle first? What's the approach? You know, it's kind of like if you were to be building a city. You've got all these different parts of the city and how do you connect them all? It's through the transportation layer. So what we're doing right now is we've gone deep into Kafka. We're using that as our messaging bus and we used to replicate data from our production databases into Vertica with sort of our own proprietary process. We hook up directly to MySQL databases, read the bin logs, do an ETL process into Vertica. Well, now we're pumping all that data into Kafka and then our replication is basically being a consumer of those data messages that are coming along and that's, you can see Vertica has done the same thing as they just announced today that they've got this Kafka streaming capability now. Excavator, they call it. Yeah, well, that's their next version of Vertica and part of that is this Kafka streaming thing. So they're thinking along the same terms. Hey, let's get a common bus for data and then everybody can be a consumer of that and so that's where we're headed. It seems like this- It's like tapping to the main data stream. Yeah, exactly. Well, it seems like the data pipeline was very disparate, all these different pieces, streaming, MPP databases, et cetera. So you're saying there's this clear trend to start to bring them together. Now the cloud guys are trying to do that with integrated data management but you don't get the functionality, right? So what's more important to you, that functionality or the simplicity? Well, they're both important, but simplicity in the end is going to, I think, win the day because it's got to be manageable. If it's not manageable then it doesn't matter how functional it is. It has services, ultimately. It's really where you're headed, right? But we used to use, for example, we still do currently Vertica as a conduit of data into Hadoop, which is kind of a backwards way that is usually done. Usually you have your data in Hadoop and you have the- Filter it and then send it into it. Exactly, but since we were getting our production data into Vertica, we found it easier than to just connect it to Hadoop and bring the data over that way. But since all this data is now going to be flowing through the Kafka pipeline, all these different databases can now subscribe to that data and it makes it much easier. We only have to manage the Kafka part of things now and then just, you know, everybody gets to consume it on their own. So do you have sort of traditional RDBMS that you're offloading, you know, ETL offloading into Vertica or is Vertica your sort of main EDW? Well, basically we've got our production databases which are horizontally sharded MySQL. We also have some Postgres databases that are sort of legacy, has mostly are a lot of tall tables for billing information. And we have our own proprietary tools, we call them Schlepp, that brings the data from- It's another in my life. It's the soccer and cross and stuff everything around. Okay, but- Well, so we brought that data into Vertica with our own Schlepp process, right? But now we're saying, okay, instead of having all these different ETL processes, let's try to just whittle it down, let's get one unified pipeline of data, Kafka, and then we can use these robust consumer products to bring the data in. And so we're going to make use of the Vertica, Kafka streaming connector. It's going to make our life a lot easier. See, we got to ask the question. We were one of the things Dave and I always riff on. It's kind of like the unknown. And this is a database meets real time problem. So, you know, good data stores, you've got the indexes, but now all the structure data, whatever you want to call it, bean sauce, junk drawer, whatever. To make something asynchronous in real time, you need data structures, you need lists, you need some coolness, you got things like Redis out there, and Amazon's got beanstalk, and you got elastic search everywhere. So you got to imagine the complexity involved. How do you get the databases to scale so fast when you got to deal with a lot of real time streaming and then the database piece? How does that all click together? Yeah, you know, it takes a unified effort between the data analysts, the people who are running the databases, and the SysOps guys who are building the underlying hardware to make it all good. You know, we've been, we're looking at speeding up our databases by going to SSDs more. They've come down in price enough and they're getting reliable enough that, and they're super fast, that's going to make things a lot better there. The meantime between failure seems to be better than spinning rust. So are you guys looking at this challenge or something that we're kind of just seeing some data on, which is how to take the benefit of Elastic Cloud, bring that into some sort of listing based streaming with Twitter feeds. We don't make use of any external clouds at all. We just have too much data, and the problem is not that they're a bad thing, it's that it's hard to get the data into them in a timely way. Now years ago, years ago, it means like two, three years ago at most. Dog years. Yeah, exactly. And so we used to spin up some Hadoop things in AWS, but it turned out to cost a lot of money. Because you're moving stuff back and forth in the bandwidth cost. Back and forth, the storage too kills you. Spinning it up and down. Our storage bills on Amazon can go up and up and up. Oh, we were spending a lot of money every month and we ran, crunched the numbers and said, well let's just build our own Hadoop cluster. It's actually going to cost us less because it'll be co-resident with the sources of the data. And so does that cost a lot of money? A lot of machines laying around too. It's off the shelf machines. You can throw it on great commodity gear. Yeah, we now have a huge Hadoop cluster. We've got a team dedicated to maintaining that and that's worked out well for us. So it really depends on the nature of the company. I think your smaller startups going into the cloud is awesome for them. But once you get past a certain scale, you got to look at it and say, are we better off doing this ourselves in our own facility? So Stonebreaker was talking about the hype and the bullshit and he said BS, but he meant bullshit. But you know, in a lot of the big data stuff. So where's the reality out there? Share the folks in your mind. What's the core reality right now in terms of problem space, solutions out there and the bar that people just want to jump over? I mean, it's not just a change the world. It's like, I mean, it's some basic stuff, right? What's the reality? Well, Stonebreaker brought out the heavy weaponry. Yeah. And you let it fly. It was so good, wasn't it? Yeah, it was good. But he had a few misfires and everything. He's saying, oh, it's all hype and the big thing I see for everybody out there is you bought into all this hype. Well, okay. Anytime you have some product out there that's actually doing something useful, there's going to be some marketing hype around it. That's just the world we live in, okay? What did he misfire on? Over it playing his hand on the hype or was he technically wrong? Was he pumping? Well, he implied that there wasn't really hardcore value being created. That was a practitioner. That must have really kind of been off-putting. He kind of shot down everything. He basically said, who do? Why are you even using it? The people who invented it to Google are abandoned it like 10 years ago or whatever. The truth is- He didn't misfire on that. I didn't want to call him out on that, but case in point, Kubernetes is open source because MapReduce was co-authored by Cloudera. Right. And that internal conversation at Google was very much like, we don't want to have the same thing happen to MapReduce to Kubernetes. So that's kind of refused Stonebreaker's comments, right there. Well, I think so. And the good professor makes a lot of good points. We're going to get back on the cube. See, Stonebreaker, you got to get back on the cube. Come on. So I think the problem here is that there's two forks, really, of data analytics. One is very batch oriented. That comes out of the whole Hadoop thing, MapReduce. But the truth is that a lot of analysts are more ad hoc in nature, and a lot of them aren't the best at writing MapReduce jobs, or they can write SQL with the best of them, right? And so that's where we have Vertica for the analysts who are ad hoc oriented, and then we've got the MapReduce jobs in our big Hadoop cluster for those batch jobs. So I was going to ask you, are you serving data scientists, or are you serving business analysts? And you're saying both. Because we have both. So we've got data analysts, and we've got data scientists. We've got some really good ones. But they're working on different things. The data scientists are more predictive analytics, whereas the analysts, they're crunching raw numbers, really. We went public earlier this year, and being able to crunch big data for the analysts, doing that in Vertica was really great, because they could get the answers they needed quickly. Congratulations, by the way, on the IPO. Thank you very much. How's that changed the culture besides mingling a few new millionaires? The reporting piece of it, no change of, you know, you're here laying it out. A few more edicts. Yeah, last time I didn't wear a coat. Now I got to wear the coat. It does change the culture a little bit in this way, in that we have to have more data security now. So we've been implementing across our whole data systems security, because not everybody can look at everything. So that's part of the problem I would advise companies to start thinking of that earlier on, because when you have to go back, there's some technical debt you got to pay off for not doing it earlier. Yeah, you can really get jammed on that. So let me ask you about the impact of mobile to your job. Okay, obviously you guys have been there from mobile first. You guys have a good presence. For the folks out there that know mobile's here and they've got their toe in the water or they're fully immersed, might be in the double down what's working, or that's rearchitect, share your point of view on how mobile's truly impacted the down the stack and things that you guys have done that you've learned. Our tissue you can share. Well, I mean, three, four years ago, mobile was like nothing in terms of part of our sales and so forth. Now it's 60%. So it's outstripped traditional desktop systems. That's the way the world is working and that's increasing. You're going to see more and more of it, but there's a problem with the form factor of mobile. So our challenge as a marketplace is making it accessible and I think we're doing a really good job at that. But we're also analyzing a lot of this stuff. So we get a lot of information in our clickstream data about how is mobile being used? And that kind of brings us to real-time analytics because we run a lot of AB tests. Real-time analytics I think is going to be a bigger and bigger player as we go forward. We're looking at a product that's right new on the scene called PipelineDB. It's a fork of Postgres and it allows you to pump a stream of data into your database, create a view on top of that that you can then join to your regular table. So if you think of the stream of data coming in, your clickstream data as your facts and your stored tables as your dimensions, what's nice for us is we can analyze that data coming along and saying, oh, this person's on mobile. Oh, where are they from? We'll join to our user details table, say, oh, okay, they're in this part of the country. And so we can start to really analyze and dig deep on the trends and how people from different regions, both in the U.S. and internationally, are using mobile. Stream on the user experience, on the fly, basically. That's what we have to do. And so we're tightening the loop on getting answers. So that's why I say real-time analytics is going to be a big deal because we've had this loop where we get data, we have to bring it into Vertica. That's kind of a tedious loop that takes some time. We want quicker answers. So we're looking at all the different streaming analytic solutions, real-time analytics. And they're all starting to emerge right now into the marketplace, but there's no winner. It's, there's high flocks. Super early. Super early. And this is the fun part. I think you just hit the nail in the head we were talking earlier about why it's so much fun to be in this business because there's innovation going on right in front of everybody. Absolutely. Well, so you heard Robert Young-Johns talk about sort of tongue-in-cheek about these Hadoop projects spinning up, what's your strategy? And they're like, well, we got these Hadoop clusters as in dog years, two years. As the markets evolved, how was the decision-making process around which projects to fund and which vendors to use? How has that changed? I would imagine it used to be pretty decentralized and pretty sort of down in the weeds. Is it escalating? Is it becoming much more strategic? It's definitely escalating. I'll tell the story about big data at Etsy. We had a back in the dog, a couple of dog years ago or whatever, we had a post-grass BI machine. And post-grass has great database, but it's a relational database and it's really excellent at looking up single records. But when you're trying to do aggregation, it starts to fall down as they all do, okay? So then we started looking around, oh, what are we going to do to replace this? We need to get fast results for aggregation. So we settled on Vertica and we've been a Vertica customer now for about three years. And that really has been great. But we brought it in just for certain analysts to use and all of a sudden it exploded throughout the company. It's kind of the linchpin of our whole data stack at this point because it's so accessible because of the good SQL language that it has. People know SQL. And so now it's used by analysts, data scientists, people running A-B tests, we run certain types of dashboards off of it. So when things start to grow like that, it gains an importance. Like I wanted to upgrade Vertica to the latest version for about two months now and the head of our analytics, business analytics team says, don't you dare? Because we're crunching Q2 numbers and so forth and we can't afford to have this thing go down for any amount of time. So that's showing how important it has become. And when things become important, then the decision process gets a little harder. The usability drives everything, right? It's like SQL is comfortable, it's accepted, accelerates adoption. Right, I mean, if you have a really fast car but and it's a gear shift car with a clutch, but no one knows how to drive that then it's just going to sit in the garage. So it's got to be powerful and usable. So I want to ask you a question to riff on something with us. Dave and I always talk about some stuff that we're kind of getting our arms around and one of the things is this whole omni-channel thing. And one of the things that we were talking about on theCUBE a couple of months ago was to get A-B testing, there's other letters in the alphabet, to goof on the Google thing that's news, alphabet, it's been kicked around. You can do a lot of different use cases. So it's not just A-B testing, which is traditional. You're limited by a lot of data. You can't store all the other scenarios. So take us through the mindset of soon there will be multitude of tests, A, B, D, all the way to Z, plenty of letters in the alphabet. Yeah, I think that. How would you attack that problem? How would you protect it? The way we currently do it is we put out an A-B test to a very small segment of our user base out there and we see how things go with that. Now we're a rapid deployment shop. We push code about 30 times a day out to our web servers. That allows us to do A-B testing really well because we can push stuff out and then we can pull it back quickly. There's probably other companies out there that don't have that kind of flexibility in their deployment routines. So it's important if you're going to do A-B testing. That is pure DevOps right there. It is, well we open sourced a tool called Deployinator that's used by a lot of companies now. Deployinator? Deployinator, because it's a problem when you have, you know. And then the ops guys call Terminator. They have their own counter strike tool. Well, when you have this plethora of servers, you know, we've got hundreds of servers that are serving web requests and you've got to get the code out to that. You have to have a reliable way of doing that so that you know it's all good and we have this Deployinator process goes through, you know, putting it out to some test servers first, making sure it all runs, it goes through all its unit tests and then it pushes it finally out to production at the end of that whole process. It's very reliable and we're able to roll back stuff easily. So you have to have that. If you're going to do a lot of A-B testing or all full alphabet testing. So we're moving in that direction because we can pull things back if need be. But you also have the, to do the analytics on it, you better have a system that can handle that kind of flow of data. And so all that A-B testing stuff, it all goes into Kafka too. And that's coming into Vertica so that we can analyze it. You look at the numbers, you look at the results. Yeah, it's interesting. I mean, the thing that we also talk about is that, you know, there's optimization for large scale, high performance web servers, for instance. You can both load them up, just juice them up, but then it might be rigid from a flexibility standpoint. When you're pushing code 30 times a day, you got to have the high performance web servers and all the tech underneath software, but you need flexibility. How do you balance that? You know, that starts to- The art and science, the kind of- Well, it is art and science together, you know, and it flows up to higher levels. You know, we have a really good operations team at Etsy. And they, we work closely. I mean, data engineering is my group. And so, but we work closely with them because, you know, hardware is changing all the time, software is changing all the time, and they have to work in concert. So, Colomahoni and his keynote sort of went back to the early days of ERP, you know, systems of record and a lot of highly customized examples. And then he sort of became packaged apps. And I thought he was going to say, we're going to see a similar track with analytics, where everything's highly customized and it's going to be more packaged. And he really didn't go there. He said, it's going to be different. It's going to be more composite. It's got to be more flexible. What are you seeing in your organization? Is there a demand, a push for more packaged apps, analytics, you know, bundled into the apps? Or is it really this sort of scenario, composite, composable pieces? Yeah, I think, again, it really depends on the stage of the company. You know, I think if you're small, packaged apps are really good because it gets you along quicker than if you've got to roll your own. But when you start to get bigger, like Etsy is now, sometimes you need the flexibility of being able to roll your own stuff. So, basically what we do is we get the building blocks, like Vertica, Hadoop, and so forth, Kafka. And then we have the know-how and the expertise to quickly write things to work. And that gives you differentiation. Yeah, well, presumably, right? And competitive advantage. We know best what's going to work for us. So, there's some package things that we use. We use Looker, for example, to help our analysts get, you know, inside into our data. Tools like that are really useful. But there's some things that we have to roll our own, especially when you're down beneath the surface and at an ops level. And you don't necessarily see that change. I mean, tell me, square that with the earlier statements about simplicity ultimately is going to win. Are we just years away from that? Simplicity, having that type of functionality, or will there sort of always be a coincidence? Well, when we talk about simplicity, it's about the end user within our organization, you know, the analysts and the data scientists. They want it to be simple and predictable. Making it simple is not simple. That's true, but that's why under the hood there, you know, data engineering and ops, we're all working together. There may be some layers of complexity under there, but the whole point is, we'll have a complex under here where we understand it, but it's got to be simple when it's facing the user. But I think where some things are going also, as you heard about SQL on Hadoop, that's been a problem now for a number of dog years. And we're excited about what Vertica is doing in there, because we have a lot of data in Hadoop that we're not bringing over into Vertica at this time, and maybe that'll work for us to make that accessible. You said it's been a problem because it hasn't been available. Well, it's like, okay, this is nice data in Hadoop, but it's too much hassle right now to get it over into Vertica, so we'll just let it sit there and we can do some map-reduced jobs. But so the SQL on Hadoop will make that, it's going to democratize that data. It'll let people run queries directly on Hadoop where they hadn't been able to do that. The conventional wisdom is Cloudera's Impala sort of changed the world. I mean, Hadoop was first, but Hadoop failed, but Impala sort of changed that, and then Vertica's always been there. But Hadoop failed because it had some onerous requirements, basically hanging a solar and Postgres server off of every node in your Hadoop coefficient. So CB, I got to ask you a final question at the hook here, but I'm getting some comments back channeled to me from folks in the industry. Are you bullish or bearish on Kafka, and why? We are bullish. I mean, we're kind of going all in on it, but why? Because it's the winner in all this messaging stuff. There's been a number of Apache-sponsored messaging brokers and systems over the years, and a lot of them have failed. And you talk about technical debt, I'm glad that we waited and didn't go all in on some of these things earlier because you can be left holding the bag and that's the worst thing that can happen to you. Kafka's out there, it's getting traction. What we think now is critical mass, it's going to be around. And another really important thing is that when you're adopting technology, you want to make sure that it does have critical mass because there's people out there who know it. You don't want to buy some XYZ solution or adopt some XYZ solution and you're the only person. And then any staffing you have to do, you've got to train everybody, it's better. I mean, this is why PHP is kind of one of the web interface games, there's so many people who understand it and can program it competently, that it wins out in the end for that. Okay, CB, thanks for sharing that. I got that last question in. The folks in the press are going to thank me for that. So thanks for the comments. We're bullish on it as well. This is theCUBE bringing you the data, I'm sharing the signal with you and we'll be right back after this short break.