 Okay, we're back inside the queue with a special two-person interview, Pradeep Sankaranti. Sankaranti? Sankaranti from eBay and Thomas Pan from eBay. You guys are in the search index group and all kinds of other back-end capabilities for eBay. eBay, obviously a big internet company, Web 1.0 company, still around I hear, doing some good work, still doing the auctions and doing a lot more e-commerce. So share with us, Pradeep, what's going on eBay with HBase? This is HBase conference. It's very technical, a lot of alpha geeks here pushing the envelope around, data store, all kinds of the classic storage challenges. However, in a real-time application environment, you need low latency and also need batch, you need real-time analytics, all that good stuff. So share with us what's going on eBay relative to HBase. At eBay, we have been rewriting our search engine, a project called Cassini. We've been rewriting the whole search engine stack, which includes the query nodes, the query processing, the indexing platform, and the entire data acquisition part as well. So for this project, we are using Hadoop and HBase. HBase is our repository for our indexing data, for our inventory data. And so we are building indexes out of HBase, so our repository is HBase. And we have MapReduce applications which read data out of HBase and building indexes. We have two use cases. One is an offline use case, which is a batch use case, where we are building hourly indexes. And we have the real-time use case where we have frequent updates coming into the system, like things like bids, bins, which is the auction, for the auction items. And it could be things like titles changing or any inventory data changing. So those are going through the real-time pipeline. So we are, we have to support like very low latency, like you mentioned, and high volume of data. That's what we have been currently working on. So we've been using HBase. We have about one and a half petabytes of data, inventory data. And so this is what we are currently working on with respect to HBase. Thomas, you're the principal engineer over there. So he just tells you what you do. So explains us what you do. What is your role with an eBank? So my role in the team, basically I work with the team, come up with the whole pipeline architecture and just work on that to carry out, to basically implement the whole pipeline. That's basically my role. So what do you, what's your core challenges that you guys have? Overcome. Obviously in Hadoop, it's a lot of complexity. There's a lot of challenges, but you know, there's dividends there. What have you worked on that you can share with the folks out there that's been challenging around some of the complexities? But I think the first challenge is basically the learning curve is very steep. So you go into Hadoop HBase world suddenly instead of having one simple configuration file, you have multiple configuration files. Each file contains hundreds of configuration parameters. So you have to understand most of them to a certain extent. And then you also have the monitoring system capture so many matrix. And you have to figure out which ones are very important. And on top of that, because there's rapid development happening in the community. So unless you can hold the programmer, introduce the feature, sometimes when you bump into bug, it's very hard to, to get the support. So, so we just go through this whole bumpy road. And I think that the whole team is, is getting there to maybe to the level of collateral support level. So Pradeep, share with us. So basically what he's saying is it's a steep learning curve, but this payout because you actually have a different value proposition. Can you share with us one, some of the value you've extracted out of that investment of labor? And then what do you guys do to make it easier around hiring new people to actually write code? Because if it's a steep learning curve, you know, more and more people, that's why this is sold out. A lot of people want to want to learn more, the community's growing. So the value you're extracting and then some of the things that you're working on to continue to overcome those challenges. So like Thomas mentioned, we went through initially when we started with HBase, we went through some initial growing pains. Just the stability of the product was also kind of not that great. But over time, I mean, the stability has gone a come a long way. We were able to like, we built a lot of knowledge around how to operate the system as a from an operational perspective and managing the system. So those are some of the learnings. I mean, it's also like help from the community. I mean, a lot of help from the community, how to manage the system and how to maintain our clusters, what are the kind of things you want to monitor and alert on those kind of things. So those are a lot of things we learned on. And I mean, the community has been a big help to us. And actually, since we were the first groups, one of the first group at eBay to like start using HBase, I mean, we kind of built up this knowledge base and there are a lot of other teams in eBay starting to adopt, trying to use HBase and they come to us and reach out to us. So my question is one, Cloudera has been a pioneer in the space for since the founding of the commercial Hadoop for that lack of, I don't want to diss on the community because it's been around for a while with Doug Gotting, etc. But Cloudera has been making some good improvements. What have you guys done with Cloudera? And specifically to HBase, why is HBase better than, say, Mongo or some other approaches out there? And it's not a political, it's a legitimate question. Is there a use case that HBase far exceeds those other guys? So talk about the, do you work with Cloudera? So we do work with Cloudera and like during our, I mean, like I said, initially when we started last year, we had a lot of issues with the cluster, I mean, getting the product to function as we required. And there are a lot of things, we work together on stabilizing the, I mean, we identified a lot of issues and we asked Cloudera to fix it. And we went back and, I mean, we had a close interaction with the Cloudera team. Couple of Cloudera members like Todd, we interacted with them and they helped us out a lot with some of our requirements. What are the other approaches like Mongo? Actually we were, we didn't go, we didn't actually investigate Mongo or Cassandra. We were basically, I mean, our previous version of search indexing platform using Oracle. So that is still serving our life, life side. But because of the volume, I mean, the rate at which we wanted to index and also the volumes at which we wanted to index. So HBase was the way to go. So we've been sticking to HBase since then. Okay, Thomas, I got to ask you the question around challenges. We were talking with Christoph Priscillae, he's talking about, you know, there's a long list of things that entrepreneurs can work on out there. The tools are getting the scene, but it's still growing, it's early, but there are benefits as we heard from a lot of folks. What are the core challenges today with HBase in terms of that need to be addressed and not in a negative way, just the evolution of HBase, because obviously there's a lot of successes with HBase. But still, like you said, it's a steep learning curve. There's not a lot of people who can program with HBase. APIs aren't standardized on there's a lot of sets of things. So talk about the challenges and then tell me your wish list. Like what was your like, you know, magic wand? I want these five features. Actually, we want a lot of features. We still believe that I think the main big gap part in HBase part is if you compare to Oracle or you compare to MySQL, all these things have been out in live many, many sites for so many years. So in case people don't know that when you become a bot, we pick up Oracle as our database solution. And we actually through the whole past so many years, we help Oracle to improve the quality of their database. So first, we are pretty confident on the HBase solution. And after more than one year, play with that and we become more and more confident and we're more familiar with the code base. But we still think that there is a huge gap in the production level. So in case you got outage, you have some hardware failure or network failure causing the whole cluster to be not usable for a period of time, then how you quickly patch the data and make it online again, right? Even though you have multiple colo as a DR scenario, you will still have the problem to quickly restore bad colo back online. So that I think since we haven't really put the whole product in production yet, there will be a lot of challenges down the road. And another part is that because community is still adding more features into the code base, when you add a new feature into the code base, it will take time to mature. It doesn't mean that once you officially release a version, it's bug free. So there will be some challenge and need some time to bake the code. And we're still learning that, work with Cardera. And I think we're getting better and better. Okay, Pradeep, do you want to add some comments to that? Yeah, some more. I mean, a couple of comments is like high availability and reliability, right? I mean, this is a 24 by seven operation. We want our clusters to be up all the time with no downtime. Or tolerance for downtime is very small, right? I mean, there are a lot of people, there are a lot of end users getting affected, a lot of people's livelihood getting affected if there's any downtime. So that's a critical part for us. So we want the features I would request are like things like high availability, more reliability, those are the things. Those are the top requests. Okay, so here's a question for you. So, philosophical question. So, age base, Hadoop, okay, I buy that, it's coupled together and at least from a relative, like, you know, brother, sister, cousin, uncle, family perspective, Mongo and other approaches are different. So people are buying into that. Also, there's some specific use cases around key pair stores and whatnot. But now you have batch, right? Some people love the batch. They want to do real batch on commodity hardware. You brought up the issue around hardware failure cluster failure. We heard from Carti from Facebook say, you know, here's how they deal with replication and it's not always that clean. They don't authorize anyone to do any app development yet. It's kind of the command and control, which is cool. I get that. So it's just different. So you got batch, very well received. Now you get real time and everyone wants analytics. So throw that in the mix to just the things that you said. So your wish list looks good on straight on paper, just on the batch side is going to be hard. How do you see the real time affected because you got batch in real time? So, like I said earlier, the primary use case are batch and real time. So the batch, from the batch point of view, I mean, HBase and Hadoop, I mean, it's providing us what we need. From the real time, also we are working on the, currently we're working on the real time application and we are using Hadoop, I think, for getting our real time indexes as well. So there may be some more tweaks which need to be made from the application side and the Hadoop side as well to get our real time flow working. We are still working on that. And we are pretty close. I mean, we are able to solve most of the... You're using Mahoud at all? Sorry? You're using Mahoud at all? No, we are not using Mahoud. So we don't... That's still on the batch. Yeah. So we, I mean, yeah, for our real time, we still go back to HBase, get the data and, I mean, index the real time streams as well. Okay, we're here with Pradeep and Thomas. Final question for each of you, then we'll bring in our next guest, the new VP of Engineering, Cloud Air, which is the big announcement for Cloud Air this week. Final question, and Pradeep, you might be able to share this because I don't know if Thomas was there, but the initial Hadoop world in New York City was kind of a groundbreaking event because it was the first Hadoop world. And we didn't know, no one really knew what to expect. We had theCUBE there live and was fun, but it was packed house, a lot of tech geeks, and some business was really the first kind of event where you had a commingling of geeks and business people. So talk about what's changed in the community, in the marketplace, just in the two and a half short years in the evolution, both technical and just any observations out in the marketplace. I see there is more inclination towards open source right now. I think there's, like you said, there's a mix of geek and the business groups, right? So maybe two and a half years back, I mean, the open source was still a kind of a, there's a lot of resistance, even two and a half years back, eBay, I mean, there would be a lot of resistance for production oriented applications for using an open source product. But now, I mean, it's come a long way. It's become a norm, I mean, to use open source in most of the big companies. Thomas, you want to add to what you've seen change over the past few years? So the biggest change I'm seeing is basically momentum. Maybe two years ago, Edgebase is not, it's basically far, was far from being productionized. And now is so many companies using that. And I think the major breakthrough is actually happened last year when in the Hadoop, some Facebook announced that their real-time platform is basically based on the Edgebase. So we're pretty confident and comfortable with the progress that the open source community is making. Okay, Thomas and Pradeep from eBay. Great session here on theCUBE. Thanks for sharing your knowledge and sharing that with the crowd. This is theCUBE, our flagship telecast. We go out to the events and extract the signal from the noise and share that knowledge with everyone out there. Great example of a company that really is ops dev, not dev ops. eBay can't go down. They have a lot of money if it goes down, people lose money, and that's not a good thing. Great to hear open source is making a real strong foothold in a production environment. I think that's a great trend. And thanks for sharing that. We'll be right back with our next guest, the VP of Engineering, the new VP of Engineering at Cloudera right after this short break.