 Live from Boston, Massachusetts, it's the CUBE at the HP Vertica Big Data Conference 2014. Brought to you by HP. With your hosts, John Furrier and Dave Vellante. Okay, welcome back everyone. This is the CUBE. This is SiliconANGLE Wikibon's flagship program. We go out to the events and extract the signal from the noise. I'm John Furrier, the founder of SiliconANGLE. I'm my co-host Dave Vellante, co-founder of wikibon.org. And our next guest is CB, who's a senior database engineer at SD. He goes by CB. Welcome to the CUBE. Thank you. So we love SD. Chad Dickerson followed his career from Yahoo. I think it was CTO and then he became CEO of Etsy. Right. Great brand. But you're also known in the tech circles for having some really strong DevOps, database stuff. So big data action, respected. Give us the update. What's going on with SD, the company? How big? Give us some stats and then let's go and talk about the tech. One of the really smart things that Chad did when he was CTO was he really said, look, we have to market our engineering department. So we created this well-known blog called Code is Craft. Just about every engineer in the dot-com world reads that blog. And we also put out a lot of really great open source products. We have a thing called StatsD that I think just about every company out there uses. It's for monitoring things. So Etsy started in 2005, but it didn't really get going until 2007. We create a market for people who make their own handcrafted goods. And surprisingly, that is a really big market. Last year, we had sales of 1.3 billion. And we've got about 45 million users and about 4.5 million shops. We saw things in, we're active in 90% of the countries in the world. And we're still growing like crazy. So enablement for this basically is a scale question. So one of the things that's interesting with these companies is that if they hit a critical mass with scale, then good things happen. So I want you to take us through a little bit of the rise and the scale up of Etsy, what you guys have done, and then what you guys are doing now. So take us through some of the opportunities you had, challenges, how you solve them, and weave in the big data story. So Etsy started out with one monolithic post-gross database. Everything was on there. All transactions, forums, convos, everything on there. And that lasted about a year. Then we said, OK, we're really stressing this server right now. What are we going to do? So we started to shard it vertically. We moved some of the features like forums out to their own dedicated database, convos to their own dedicated database. And that lasted about a year. Then we said, well, now we've got to start sharding horizontally. So Chad, who knew some guys from Flickr because of their Yahoo connection, these guys really pioneered sharding horizontally. And they came in and brought their ideas and so forth. So we sharded horizontally this time on my SQL databases. And what we did then from a big data standpoint is we said, OK, now we need to start getting some analytics. Because when you shard horizontally, it's great for looking up an individual record. It's very speedy for a website. Good response. It's terrible and sucks heavily to do aggregation. So we said, we need to do some aggregation. So how are we going to do it? So we got another Postgres database. And we started replicating all our data from our shards and our monolithic Postgres database to this new BI Postgres database. Well, what do we do? We just created a new master, right? Because everything has to go on there. So then I was charged. It's really from Peter to give to Paul, right? I mean, what is what's going on, right? But you're scaling, you're growing. Well, so then I was charged with we got to replace this thing. Let's find something that works. So I went out and looked at all the analytic databases, basically the columnar databases, because we wanted to do really good aggregation and wound up with Vertica for a number of reasons. A lot of the folks that are watching know we have some syndication going on from sites that aren't in the weeds in the technology. These big sites like SDN, these huge sites have under the covers technology challenges. One of them is big data. So you get to a point where you're full as a tick and then you explode. And then you've got to re-architect. When did you get to that point when you said, OK, now we had to go to phase three. So Postgres pushes the envelope, busting out the seams. Now it's sharding horizontal, stopgap measures. When did you get to a point where you get some smooth sailing? OK, so we've made that each of those transitions because we couldn't live any longer with what was. So I think that's probably like every company. Exactly. A lot of these things are just hacks. And the best hacks are the ones that live the longest time. So this BI server was kind of a hack. The way we replicated to it was a hack. So once we got Vertica in-house, it's amazing what a difference it's made. Because we were able to take queries that our analysts were running on this BI machine, some of which took four days, and all of a sudden they're running in minutes. And the biggest surprise is we thought, oh, just our analysts will use this. But we've had to up our license, add more nodes, because everybody's jumping in on the Vertica action at Etsy. Because it's so fast, and we use it to do so many different things, things that we didn't anticipate when we first got it. But we power all our internal dashboards just about on it. We use it for really getting a tight loop in on our AB testing, and we run financial reports on it. So on any given day, there's probably 100 people at Etsy hitting it and running queries. And we anticipated maybe there'd be 10. I saw a chat on this thing. It was CNBC or some news channel all dressed up wearing a tie. I'm like, that doesn't really. That's why I don't wear my tie anymore in the queue. This wasn't fitting. And Chad, if you're watching, you know, don't need to wear the tie on CNBC. Right, that's not the chat we know. He's in a pretty casual attire with a high ball in his hand. Chilling out there. So let's talk about, we talked about the gaming guys on. Dave and I were just talking about, we've had peak games. We talked about the Riot games. A lot of these guys, all big data, large scale e-commerce, whether you're e-commerce or online gaming, the real time pressure is there. You got a lot of scale, a lot of traffic. Oh yeah, yeah. How are you guys dealing with the big data side? With Hadoop. It's a Hadoop world, right? So you've got the need to store stuff. I'll get to a later kind of mentality, which you want to do. And also the production side of it. So how do you blend getting the data captured with new emerging stuff like Hadoop and dealing with this production, not disrupting things? How do you handle that? Well, a little bit on Hadoop, we were, and we still use Hadoop to improve our search algorithms every night, all the search terms that were used on Etsy are downloaded and put into Hadoop. And we run a bunch of stuff, map reduced jobs on that. And then I think it goes into Mathematica and they get some rules that are generated out of that to help our search. So we were doing that in the Amazon Cloud for a while, but the bill was like 80 grand a month or something and finally said we got to get rid of that. And so we just bought a 200 node Etsy-doop cluster, right? But here's the problem with Hadoop. It's hard. Give me your best Hadoop engineer and I'll give them about a one in 10 shot of getting a MapReduce right first time. It's a very iterative process. And so we- Specialized, iterative, and it's also complex. Yeah, and it's really geared towards batch jobs. But we have a lot of people running ad hoc queries and we actually use Vertica as a front end to Hadoop. Anytime someone says, oh, I want to run a MapReduce job, we'll say, can we do this in Vertica instead? Because a lot of the data that's in Hadoop then gets funneled into Vertica where it's now accessible with a whole bunch of really great analytic functions. And so that's how we roll. I mean, we're actually doing less Hadoop now because we're able to get that data into Vertica and it's kind of taking the place of that. So a lot of people say, okay, I'll use Hadoop to do my filtering and then I'll jam all the nuggets into some other database. That's kind of what we're doing. It is, okay. Yeah, we use, I'm not trying to downgrade Hadoop here at all, it's got its place. But I'm saying for, you know, interface is everything. And you know, we want to get answers and get them quickly. And the best way for us that we found is, hey, if we can get that data into Vertica, then our people know how to write SQL and know how to do that really well. And that's what we want to do. And the economics work for you. Oh, absolutely. And how much data are we talking about here? 30 terabytes we have in Vertica. Well, I mean, Kay, what's interesting is you said interface is everything. I like that comment because we're talking about it's not only the throughput of the actual petabytes but we're talking about actual human capital, all the manual work that gets automated away. So just from a workflow standpoint, I mean, think about, you know, the hassles. We've seen this in social media in our world on the big data side. It's like to do a certain workflow, value chain, give me automated away, people got free up their time. So a lot of people feel like, you know, big push, make Hadoop real time, bring SQL to Hadoop, all this stuff. A lot of people say, well, Hadoop was never meant to be real time. Right. You sort of, in that camp, I mean, Stefan Groshup of Datamere said, oh, it's all nonsense, you know. Yeah. What's nonsense, Hadoop? Hadoop real time. Not meant to be real time. It's batch. I'm in that camp. There's certain other products and technologies out there that want to put, you know, SQL closer onto Hadoop. And what they want to do is hang a lot of software on each of the nodes. And that's a maintenance kind of nightmare that we just won't go for. You don't want that from Vertica. What do you want from Vertica? Well, we just want what it's giving us, which is accessibility and speed. And what's really cool about it is when we got this. He's a Ferrari driver. We're talking about this Ferrari of big data. Well, you know, there's this expression, the total cost of ownership. And usually that's like, OK, this software is going to cost us this and so forth. And maybe we've got to hire some people and that's going to cost us this. We didn't have to hire any people because we found that our DBAs could administer this stuff just fine. It actually shares a lot of DNA with Postgres, which we were very familiar with. But what was really surprising is that we didn't have to change any of our queries. We had a whole investment of many, many hours put into hundreds and hundreds of reporting queries that were running on that Postgres BI server. And we were able to bring those over and run them unchanged on Vertica, but just have a real kick up in speed. And that is a cost that we didn't have to endure. If we had to go and rewrite all those queries, that would have been a big expense, right? So we didn't have to do that. Because that's going to be a big part of your migration cost if you have to rewrite everything. That's going to be the biggest part of your TCO. Absolutely. It was nice and easy. If you move cross-country, you can spend 100 grand on the moving company, and we didn't have to pay that 100K. It just worked beautifully. So what do you want out of the roadmap? If you're talking to Vertica, what do you push them to do? You said they're delivering what you want, but you've run out of gas three or four times. Yeah, yeah. What do you want them to do next? There have been some really good talks today about the future and near future of Vertica. And one of the things they're adding is sort of their version of materialized views. So these are projections which are pre-aggregate. So what that means is that you can do things like roll-ups and have those instantaneously available. For example, we have our analysts every day creating tables where they're creating their own roll-ups like, oh, for this state, here's how much activity there was. Now we can write projections that do that automatically and so they don't have to do that anymore. They can now move on to the next thing. So it's a fast roll-up or it's a roll-up later? It's actually a full roll-up that's actually occurring. It's a full roll-up and you pay a price in that when data is loaded, it's got to do that calculation, but then you save on the way out when you're doing a select query. It's kind of like, think of this projection as kind of like an index where you're indexing on some type of expression or function and that's really nice. What's happened on the Etsy on the business side? As you guys map out your strategy and your growth plan, what's the direction and how is that driving the data strategy? Well, one of the big things we want to do is really expand in international areas and we really like having, being able to deep dive into the data and look how people are buying things in these foreign countries because there is a lot of cultural difference across the world, which is a great thing and people do buy things differently. I know like in Germany, for example, they do not buy stuff via credit card. They have their own way of doing things and so we get to, and then also there are certain areas where certain items are more popular and we're able to discover that. Yeah, we had the gentleman on from Pete Games and he was saying that the culture overseas and let's say the Middle East is totally different in terms of who adopts games than the US, John. I mean, it's just. So one question I want to add, I know we're stuck on time here to the business model is obviously you guys are on the e-commerce side is with BuzzFeed getting the $50 million in financing just announced in your neck of the woods with an $800 million valuation, this notion of native advertising is coming in, mainly they're in the social channels so we all know what that means. Big data is a big part of socials, we see that. So you guys look at social as a distribution channel and look at the data piece of it, not to advertise so much too, but you have an acquisition of audience, trying to get traffic, get conversion, the social graph aspects interesting, trusted consumption from friends is interesting. You guys have a big data play there? Well, you know, you're seeing more and more commerce happening in some of these social media things like people sell stuff on Instagram, for example. And so they're making money that way. So we're looking at all these channels and saying, should we be there? It's the new affiliate channel. I think, it's exciting. So I mean, I've been bringing it up because when we're talking to Dave and I are talking about in depth is that all new models are coming to the table because of big data. Well, we use big data and analyze, for example, Facebook, we've got a lot of stuff going on there, recommendations by friends and we'd like to look and see how effective is that? And big data lets us look at it. Well, we had a guess on earlier, we were talking about with three other guests, this whole retargeting thing looks good on paper, certainly the numbers jump off the spreadsheet, but the user experience certainly is not there, it's a hard problem to tackle, this whole retargeting thing. How do you know when someone wants to come in just because you're using cookies? Well, like I said, the interface is everything. And you always have to make, I'm gonna have a, in the past period, I was a product manager and I always want to make things work for the user, right? I love it. That's exactly how we feel. Interfaces everything, user experience, user expectation. The preferred future is upon us. Hopefully big data will make it better, not just to sell ads, but really make a great product experience. So totally, totally awesome. Look, when I first started out as a C programmer, we used to have to compile things overnight. And so that was a really long loop. If you had one syntax error, you know, you came in and cried. Had pages back then, right? Yeah, yeah. Had a page where we'd go off. Right, exactly. So, you know. Turn around time, 45 minutes for the punch cards. The quicker you can do time is money and when you can tighten these iterative loops, the more you can do that, the more money you're gonna make and the happier you are. CB, we're pushing the envelope here, we're getting the hook big time, I had to push him down twice. Love staying on with you. Great, great company you guys have. Love the technology. Big fan of Chad and the team. Congratulations. Thanks for sharing your opinion on theCUBE. We'll be right back. Thanks a lot. I'm John Furrier with Dave Vellante. This is theCUBE live in Boston for HP Big Data Conference. Be right back.