 Hi Krishna. So I don't have a PPD or anything, so maybe... Okay, so just to give you a little bit about my background, the last time I worked with... We worked extensively with Hadoop and Hoolivers. I started by working in the US before moving to India. But that was 2010. So in technology, I think, it's moved as far as taking anything that I say with the handfuls of salt. But I'd like to share some experiences for us on why we moved to Hadoop, as well as what was more of our learnings from it. And then I guess it's going to be a lot of Q&A anyway, so you guys can take deeper, ask me deeper questions at any point in time. I was more on the product side, but you know, I think that there's only as much you can do without being technical. So I did a lot on the engineering side as well. So I'd be happy to go and give a side of those products or engineering. So maybe I'll start off with why we moved to Hadoop, and then maybe if you have any specific questions you can ask me later. And I'll also share some of our learnings. So we initially had... So Hoolivers is a consumer internet company. We make browser plugins. It used to make browser plugins, now do many things. That, you know, you can install on your browser and then it, you know, gives you a free range of view on Google searches and so on. We had a lot of downloads coming in from, you know, partners like Mozilla and Ad Arms. So we, you know, you basically just have some number of directories that we would get on an average of 100,000 to 200,000 downloads every single day. These are 200,000 clients being installed, right? 100,000 to 200,000 clients being installed. And each one of these clients would send back a lot of data. So as in most cases, we started out with MySQL as a solution. And actually speaking, MySQL is usually the right solution, 99.99999% of the price. And that remaining percentage is usually you who are doing something wrong and really have to follow MySQL, right? But there was some very specific reasons why we had to choose Hadoop. And I somehow don't like the idea of dying big data one is to one with Hadoop. Because a lot of use cases for Hadoop, where Hadoop becomes useful even if you don't have big data. And I want to talk a little bit about that towards the end. So why do we move to Hadoop? So MySQL is awesome for web analytics and, you know, all kinds of analytics. But when you have clients, right? And we have clients across multiple operating systems, browsers, different browsers. So since we are a client, we have to, you know, have a separate client for Chrome, for Safari, for Firefox, for IE. So each of those, you know, browser, ADS, etc. And then we had Android applications, iPhone applications, flash clients, website, embeddable and non-embeddable that is installable and so on. And we had exeering teams that would move and infiltrate on these products at a very rapid rate. So what does that mean? First of all, we're getting lots of downloads. It means a lot of data coming in. But MySQL can easily handle that, right? But then you say on the product team, you want to start doing analysis on what is working, what's not working, what part of your usage is being driven by some specific feature and a specific product. And these are not very easy to express as SQL statements, right? Like if you want to find out how many impressions are being served in, you know, a single sum query from SQL will easily tell you that. MySQL is awesome at telling you that. But if you want to sort of find out, okay, of the 1 million users that use my product today, how many of those are using it from that particular month and which one of those are using it because they use this specific feature, right? Like in SQL query for that, incredibly hard, right? And the second thing is that, you know, you would have schema changes happening very, very, very often. I mean, we had a suite of, you know, 7 or 8 products, because each one of these clients has some differences, right? Because it's a C++ client that's running on, it's not like JavaScript or even JavaScript here in these industries. So given that there's a lot of schema changes happening, we very soon realize that having, you know, the typical RDP infrastructure of tables and so on won't really work that well, especially because we won't run a lot of ad hoc queries later on and also some queries are being here on a real-time basis like time series data. And so then we move to protocol buffers. But once you move to protocol buffers because, you know, it helps you with, you know, division of your data structure, whatever that you use to store your data, then obviously mySQL became a lesser and lesser fit. So one key takeaway, you know, that I like to share is that, you know, on the other side, on the other side, you notice that, you know, you notice that in other use cases where you're seeing big data solutions or, you know, things like Hadoop and so on. And on that slide, you know, there were only really two rows out of 15 or so that you would even think of putting in a database, right? Like banking transaction data, yeah, sure. You can find some way to fit that into a table. But what would you do when you're analyzing, you know, X-ray scans? What would you do when you're analyzing audio files or when you're analyzing videos or, you know, analyzing textual, you know, documents, doc, PDFs and so on? These are not things that you're traditionally dubbed into a database. So, you know, that's actually one great use case where Hadoop comes in, where it helps you, you know, it provides a more general framework where you work with those kinds of formats, even if you don't have, like, beta bytes of data. And so that was one of the reasons for us moving to Hadoop. The other reason for moving to Hadoop was that Hadoop was the only solution that was kind of stable. We were using it back in 2000. I think we started experimenting with it early 2009 and then moved to it six months later once we were fairly confident that it would work. And it was sort of the only stable solution that would let us, you know, that would enable us to do the kind of ad hoc learning that we wanted to do as well as enable us then to run sort of a daily job that would dump data. We need a time series and all of that into my SQL. So it's usually a combination of SQL and Hadoop. It's usually an either or situation, right? So you use SQL for storing the data that you want to graph and you use Hadoop for ad hoc learning as well as generating data that gets fed into these tables. So what were some of our lessons, you know, moving to Hadoop? It's honestly a lot more pain than you realize when you start off. It's a lot better now because Hadoop itself is more stable and so on. But what were some specific problems, right? The specific problems are that even though you have big data and big data solutions, the little details and the little problems don't go away. For example, you know, if you have cut up data, right, no matter what analysis you do with what tools you're still going to get results that you can't use. Big data solutions still don't prevent you from asking the wrong question to begin with. And then, you know, your entire 1,000 node Hadoop cluster is not going to help you, right? Then you have, and big data is not a solution for not understanding the basics of statistics, right? There's a lot that you can do with sampling and, you know, and there's a lot of literature around how you do sampling that I think. Then obviously some cases where you can't use sampling, you have to go find grain. Like, for example, you know, in PAPA's case, they can't just sample, you know, 2,000, 3,000 users if you have statistical confidence that each 3,000 users, whatever, right, are from Ireland and everybody is an Indian so we have to ban India from our system. So there they have to go on a granular level looking at each person's transaction log and saying, is this guy, you know, high risk or is he not? Though they may use these to get statistics to inform that decision. So that's obviously one case. The other case is that, you know, if you want to understand product use, product use, at a user level, then obviously, you know, you can't sample. But for a lot of things to find out, did this feature really make an impact on our numbers? A lot of that can be done with just sampling and you don't really need big data to answer those questions. And one of the other takeaways for us from the Hadoop experience is that you run into problems that you don't generally realize with SQL, like, you know, how many of you have written an SQL query at least once at some point in time? Right? Almost all of you, right? And how fast will you answer? But whether there's an error or not, you know, instantly. With the Hadoop job, the way it works is, you know, I like the job, I pick up the job, then I go for the run or something like that or in some cases for our, especially one of our nightly jobs, if we run a nightly run job that would compute these things, it would take five hours to finish computing that. And like any other system, Hadoop also, it has its fair share of downtime. You know, there's a single name node, which is like the one thing that goes in all kind of thing. And if that goes down, you're going to have to help us sometime and you're going to have some downtime. There's no, maybe there's change now. I don't think it has, but if that's changed, great. But back then there's a single secondary name node, except it's not really a secondary name node. It's just, I don't want to, I don't know how technical, you know, I should get here. But basically think of it as you're going to have downtime. And if you have downtime and it's going to take you, say it takes us three days to finish bringing the cluster back up, then every single day's job is going to take five hours. So how do you catch up those five days that you've left, right? Because you need to compute that particular day's thing, as well as the previous five days that you've lost. And the company is not okay with not just the predominant process, because we have four months or we have to report eight hours to. So these are problems that generally don't happen often when you get results back very quickly. So these are some of our learnings. It's also, the other thing is that, you know, you never know if the query of this job is going to succeed until it actually finishes running. So we will have errors where the job would take five hours. And in the fifth hour, in the last 78% of what about the reduced phase. So if you know MapReduce, there's a map phase and there's a lot of things that goes in between, and then there's a reduced phase, it would stop at 78% of the reduced phase, which means that's like four hours, 55 minutes, and then it says, you know, out of memory. And then it's going to, and then you're preset it back, you know, to step one. And that happens a lot. So these are some of our takeaways. A loop is not necessarily one-on-one with big data. You can use a loop for a lot of use cases that makes your job easier with some systems on top of it, even if you're not dealing with big data. Big data is not the solution for every type of problem and my experience also.