 Okay, we're back live here in Silicon Valley in the heart of big data at Santa Clara, California. This is Silicon Angles exclusive coverage of Riley Media Stratoconference, day two of three-day broadcast wall-to-wall coverage. I'm John Furrier with Dave Vellante, wikibon.org. This is theCUBE, and we are here with Quantcast. Jim Kelly, who's the VP of R&D at Quantcast. Jim, welcome to theCUBE. Thank you very much. So you got a little interesting angle on all this. Quantcast obviously collects a lot of data. You guys are big data practitioners. For those who don't know, Quantcast is a service that... We use. We use. We're quantified. Check it out. Glad to hear that. Silicon Angle Network. We think your numbers are a little low on us, actually. You know, we get that a lot. We prefer Google Analytics. We actually think Alexa is a better audience. Google Analytics numbers are actually higher. I think Alexa Toolbar market is not doing too well. Maybe we could talk offline to help us out. But it's all good. We're one of the few sites in our business that Quantify it. We're proud to. And we really appreciate that service. It's fantastic. But you guys have some interesting things going on. You're obviously big data practitioners. And why don't you tell us sort of your angle on things and then we'll get into the Quantcast file system. We are indeed big data practitioners. We've been doing big data since 2006. We were big data before big data was cool. And we ended up innovating a lot of our own technologies to handle the volumes we were getting. And one of them was our file system, the Quantcast file system, which we built to deliver better cost efficiency, especially at large scale. And we've developed it over five or six years internally. We run our business on it. And in September, we released it to open source so that other organizations could use it as well. Okay, so tell us, why does the world need an alternative to HDFS? What's the problems there that you're trying to address that you obviously addressed internally? Well, the biggest problem we were trying to address is just how much it costs to process big data. Hardware costs a lot. And when you're doing big data, big data sets tend to grow and the costs tend to grow as well. And it's not just disk drives, it's servers and it's racks and it's operating costs around power and cooling and real estate for renting space in co-location facilities. And it's operating staff to keep it all going. And it costs a lot. So once you're operating at scale, you're going to be paying five, six, seven figures per month on your computing power. And so what we built was a more efficient file system that makes better use of space and that effectively doubles the storage capacity of a Hadoop cluster versus stock HDFS. So, I look at this as almost like a little mini AWS, right? You have all this cool tech internally and say, hey, we can make a business out of this. Let's, you know, put it externally and help the community out. Yeah, and we're not trying to make a business out of it. But we're, you know, Quankcast is in the audience measurement and advertising business. And we don't sell file systems. We don't sell support around file systems, but this has added a lot of value to us. And we've used so much open source software internally and benefited from it so much. It's really good to be able to give back. So, the value that you get out of doing this is you get more contributors, more people actually, you know, innovating around the environment, right? Because you can't do it all yourself. Is that right? Yeah, well, it means a lot to developers. You know, if you're a developer, it's a good thing to be contributing to an open source project because it gives you a chance to make a bigger impact. And so it's popular with our team. And for the company, you know, file systems, especially are really critical pieces of infrastructure that benefit from the scrutiny that the open source model provides. So we get some benefit that way as well. So what are your objectives specifically and how will you measure the success of this initiative? Well, success for us is getting some high quality collaborators and extending the product together. Okay, so when you say getting some, I mean, we're talking dozens, you want hundreds, you want thousands, what would you consider a successful endeavor? No, I don't think we have such grand ambitions. As I say, we're primarily audience measurement and advertising, and that's going to be our focus, but I think this is an opportunity for other organizations to take it and run with it. So paint a picture if you could. I don't know how deep you can go, but really help us understand the internal environment at Quantcast to the extent you can. You talked about the infrastructure, servers are expensive, storage is expensive. Paint a picture of us. What's it look like under the covers to the extent that you can share with us? Well, it's pretty big. We get over 50 terabytes of data in the door per day. We process on the average day over 20 petabytes and doing it on about 1,000 machines, reasonably modest hardware. So all the components of our system get a fair amount of exercise. We hammer on them pretty hard. So when you say modest hardware, I mean, would you consider yourselves a sort of a hyperscale class or are you sort of buying from ODMs? Or do you buy traditional three-letter server vendor or two-letter server vendor hardware? Well, we buy commodity hardware. We speck it out ourselves. We've always been really cost-conscious. Back when we got started, we were a small self-funded startup with decent-sized data and a very modest budget. And our budget has grown a little since then, which is a good thing, but we're still very cost-conscious and really conscious of how much we're paying for hardware and how efficiently the software is. So what does that mean? I'm not even sure what commodity hardware is anymore. Is that Dell commodity hardware or is Quanta commodity hardware? Well, sure, more or less. I mean, as opposed to an appliance, if you want to buy a rack of Nutiza, for example, it's going to set you back a million dollars. And if you want the equivalent of a rack's worth of hardware at Amazon, that'll cost you a million dollars per year. And it adds up really fast. So Amazon's the more expensive in that equation, folks. As we've been saying. If you're buying off the rack, so to speak, a rack's worth of hardware will cost more like $200,000 and tens of thousands of dollars per year to operate. Okay, so it's not exadata, it's not, okay, it's not the purpose-built appliance. You talked about 2x the efficiency of, for instance, HDFS. Can you talk a little bit more about the tech behind that and how you achieve that? Absolutely, so the number one challenge in designing a distributed file system is fault tolerance. When you've got 100 machines or 1,000 machines, you can't expect them all to be up and running at any given time. So your software needs to be able to tolerate bits of your data going missing. And the way HDFS does that is it makes three copies of all the data, which works fine, but it's expensive. It means you need three times the disk space and three times the servers and three times the power and three times the cooling and so forth. Tried the true brute force approach. Yes. But not the most efficient. No, no, so QFS uses Reed-Solomon encoding, which, or erasure coding, which has been around for decades, it's in CDs, DVDs, they used it in the transmission protocol for the Mars rover, among other things. Heavy math. It's heavy math, it's got a long lineage, but it hasn't seen a ton of use in distributed file systems. But we put it in QFS because it's got a big space savings. Whereas HDFS takes triple the storage, QFS takes only one and a half times. So relative to HDFS, it saves half your disk space. Okay, so you're creating slices and distributing data and then you can lose X number and then recreate them. Right, you're creating data slices and parity slices and by default, we use six data slices and three parity slices and then we write those to nine separate places and if we can read any six, if any six remain readable, then we can reconstruct the original data. And that's good, that means that's actually a little better fault tolerance than you get from HDFS. If you're making three copies of your data, you can afford to lose two, but you better not lose three. We can afford to lose three. Do you encrypt the slices? No, we don't encrypt the slices. Do you see that as a need that the open source community might pick up, for example? Is that sort of on the to-do list or not necessarily a high priority? Well, I think the security features that are probably higher on the to-do list are more things around authentication, not necessarily encrypting the data per se. Although there are plenty of other layers where it makes sense that you might want to encrypt the data. Right, okay, so we talked about the to-do list. Where would you like to see and where do you encourage the community to pick up sort of the innovation and where are you guys spending your time when they're working that? It's a good question, and I think the, for Hadoop, Hadoop serves a very broad set of customers with a very broad set of use cases. And we've had a pretty specific one around the scale we're operating at and our desire for cost efficiency. And I think the opportunity is for other people who have other sorts of use cases that don't line up very well with ours to go and build them themselves. I know other organizations care more about federation or they care more about high availability and those haven't been high priorities for us, but they may be higher for them. And so we're trying to give them the tools to build them if they want. So since you've announced the give back to the open source community, what kind of uptake have you seen? Are people actually experimenting with it, deploying it? Yeah, we're seeing a fair amount of experimentation and we're seeing a little deployment. I think the challenge with an open source project is unlike licensed software where someone's paying you, you put it out there and you don't necessarily know who's running it. And this thing is, it works pretty well, so people don't have to come back to us for a lot of help. I think they're kind of self-sufficient with it. But yeah, we do know of other places that are using it and they're happy with it and that's great news, excellent. All right Jim, well listen, we really appreciate you coming by and sharing the Quantcast story and congratulations on the give back to the community. And good to see you, nice to meet you. Okay, well, I didn't get a word in edgewise, Dave. Thanks for your great interview. Sorry, John, sorry. Love Quantcast, you guys doing real time data. Thanks for having on theCUBE. We'll be right back on our next segment, theCUBE, a Stratoconference. Combining down day two, getting ready to do a wrap up here. Wall to wall covers three days of in depth coverage. We'll be right back with our next guest of this short break.