 Hey, guys. My name is Joy Deep. I work at a company named Cubull. So I'm going to talk a little bit about stuff about what we are building. Thanks to, I guess, let me give a thanks to Fifth Elephant for letting us present some of our talk here. The fun things that we're doing. OK, I'm getting nervous. All right, so if you give me a little bit of background about who we are, what we are trying to build, and what's the context for this talk. So I'll give you a somewhat personal narrative on this, because that's the easiest way for me to narrate the story. I remember I first played around with the cloud in 2009. I was on paternity leave, and I came back to India. I just sort of logged into AWS and said, OK, I was working on Hive at that time and said, let me make Hive work on the cloud. So that was a very eye-opening experience for me. So I come from the world of NetApp and Oracle sort of working on big honking enterprise class systems. And then here you had this sort of hardware on the fly, storage on the fly. And it was just magical. And I think I had always sort of at the back of my mind wanted, when I had sort of finished up my gig at Facebook, to sort of come back, come out and say, let me build some stuff for the cloud. So that's sort of one part of it. I just find that cloud is really, really cool. So that was kind of the next slide. I mean, I think you guys know all this stuff. So you can get hardware on demand. You can expand and shrink as you wish. It's cheap, not the cheapest, but there are magical things like spot instances. You can bid for stuff. I mean, that's kind of amazing. And it's kind of infinite storage. So you just sort of, as much data as you want to store, you keep doing your STDV puts. And it just stores it. And if you ever worked in an operational scenario where you were actually managing boxes and capacity and all this stuff, and that's just such a nightmare. You're working at a fast-paced company that's sort of growing very fast. And every day you have to wake up and think, oh, man, like we are at 90% capacity, 95% capacity. Like did the offskies order enough hardware and stuff? So a whole lot of really not very interesting problems. Somebody else is sort of taking care of those problems. And we are obviously all thankful to Amazon for showing us the way here, literally. But the other part of me is that I'm not just an engineer who builds things, but I also have played a role as an analyst. So I've worked with business teams and tried to build data-driven applications. Sometimes they have been interesting ones, like doing data mining, building recommendations, and things like that. But often very basic stuff, like just building reports, financial reports, auditing stuff, tallying stuff up, very mundane stuff. So one thing I've always felt very deeply is that when I am in the role of an analyst, the last thing I want to do is understand how something is implemented. So there are all these, of course, I've worked with like tons and tons of database folks. And they're always coming up with this fancy stuff, right? I mean, oh, here's the best freaking way of doing joins. You just have to set these five options, right? And everything will just magically become fast and beautiful, right? But when you are an analyst and you are doing your day job and you're analyzing data, building workflows, building data pipelines, the last thing you want to think about is how something is implemented, how something is optimized. And that is where the cloud is really complicated, right? So I mean, I actually borrowed these pictures from some Amazon presentation on SlideShare. So it's kind of unfair to them. I mean, I think the platform is beautiful, but we all know it's also very complicated, right? So you have to learn a lot of things. You have to learn key pairs, regions, buckets. It goes on and on and on. And yeah, it's all very nice and interesting as an engineer. But think of yourself as an analyst sitting down there and doing your day job. I mean, what does your day job have to do with any of this stuff, right? So these are the two things that we sort of tried to bring together. We said, OK, cloud is really cool. And we want to make life for analysts like really, really simple, right? Have them be able to exploit the power of the cloud. What do I have here? So this is just a enumeration of, OK, well, if you want to run some Hadoop jobs or Hive jobs, what are the things you have to do? You have to set up an RDS instance with your own metastore. You have to start your own cluster. And then you're starting wondering, OK, well, how many nodes do I want? What kind of nodes? Large, extra large, cluster compute, spot, on-demand. Well, spot, how much should I bid? What if I don't get the bids? What if the instance just disappears underneath me? Even if you actually understand all these concepts. I mean, even just getting there is quite a bit. So it's pretty complicated, right? So what's the way out, right? And so what I'm going to do in this presentation is talk a little bit about what we have built. I don't think I'll be able to do justice and present everything. So I'll try to focus a little bit deep into some of the specific technologies that we have built. But the general understanding I want to convey is that these are part of an overall spectrum of things that we are working on, right? All right, so this is kind of if you signed up for our application and you've just logged in via the browser, this is what you would see. You would see a dashboard where you see, okay, these are the Hadoop jobs, Hive jobs we have run, this is the status. You can go and look up the results and everything. And what you would not see is how many nodes do I need or do I need to first get this data into HDFS or things like that. So you focus on your business logic, right? That's your day job. We focus on the infrastructure. That's our day job, right? All right, so, okay. So simple things, right? I mean, dead simple. I mean, I feel ashamed of putting up this UI because it's nothing. It's just a Hadoop Job Tracker page, right? But you see this page is actually from like, you know, almost like a month old, you know, some query I ran like a month back. And I can just like click and pull it up. I don't have any instances running on Amazon. This is just a managed environment where the Qboard data service has managed to sort of store the logs away in the right locations. You click on it and you just see it, right? Pretty very, very basic stuff. So, as I said, you know, what I tried to do in this talk is focus deep down on a couple of things that to me are very interesting from a technological point of view and just leave the rest to sort of, you know, maybe discuss afterwards or you can go to our website and stuff. So auto scaling is one thing that one of the first things we built, right? So we have this goal of, you know, making life really easy, right? So you write in a query. So by the way, obviously the background here is that most of this stuff is like Hive and Hadoop, right? So you're coming in, sitting down, writing a Hive query, right? So the first question is, you know, how many nodes, right? What kind of nodes? How much memory? So on and so forth. So auto scaling is obviously a very, very, like, simple and basic primitive. All of us know it. We are all happy users of it for the web tier. But it turns out it's not that easy for Hadoop. So I'll talk a little bit about why it's not easy. But let's see at least how it's supposed to work. So you are a new co, you know, you are a customer of ours, you get a virtual cluster. So you log in and say, okay, yeah, there's always a cluster available. You fire a query, it turns out to be a query on a very small table. And we bring up a small cluster for you. So you don't have to think about it, okay? Now, some other endless comes in or maybe, you know, you yourself sort of fire a second query and you run a query on a much larger table, right? What happens? We size the cluster up, right? Again, you didn't have to think about it. We were actually able to multiplex two compute jobs onto the same set of nodes, which is always good, right? Queries finish, get rid of the machines, right? So one of the smart things we do is you already paid for the machines for the hour, right? So we will keep them running for an hour because you already paid for them. In the chance that you were just taking a coffee break and you were going to come back, you would still have your cluster and your resources available. Of course, it will be all transparent. It's just that those things will be a little bit faster for you, right? And after some time, you know, if none of your users are active, then everything is just on, right? So what do you see here is that, you know, our customers have been able to focus on the business logic and we have been able to focus on, you know, provisioning, deprovisioning machines, how many, what type, optimizing on cost, making sure that, you know, you get your money's worth and so on and so forth. Another sort of, you know, very sort of natural tendency that happens in companies is that if you are sort of a cloud intensive, you're using a cloud a lot, you know, you start with one job, two job, you write some flows and stuff, right? But very soon, you know, you get quite a few of them, right? And why not? I mean, we all know this, right? So you start using some infrastructure, you should anticipate a lot of growth in that. So over time, what you will find is that, you know, there are these like machine-generated jobs, there are these human-generated jobs and all of them are, you know, spawning their own clusters, their own machines, they are doing stuff, right? And I was talking to a friend and a customer in the Bay Area and he was saying that he had done some analysis on the machines and the instances that they were using and he said, you know, we have like, maybe like at 40% utilization, right? It doesn't make any sense because, you know, everybody is coming in and starting up their own machines. Now, instead, if everybody was going to the same set of machines, now you get like way better utilization, right? Because, you know, while John has spawned up a big cluster and half an hour into the hour his job is done, those resources are still available for use by other people and that's how you get efficiency. And so one of the things that he was telling me was that, hey, you know, I wish I could, you know, send everything to one place, but I don't know how to scale this up and down and you guys have solved that. So such a simple thing, you know, auto scaling is like, you know, the word we all know, but you know, solving that simpler problem turns out to be a very big win for most organizations. Oh yeah, this is for my kid. She loves like happy faces, so everybody's happy, right? When we do this. So just to add a little bit of like from the technology point of view, like why is this not a simple problem? So normally, you know, when you are doing auto scaling for like web tier, your load is like reasonably smooth, right? So you can say, okay, well, if I'm trending up, if I've had high CPU usage for some amount of time, then it's most likely that I will have high CPU usage going forward as well. And so you can make a decision to add nodes, for example, right? But Hadoop is not like that, right? I mean, these bad systems are extremely bursty and the fact that you are at 100% CPU for the last five minutes actually means nothing. You could be at absolutely 0% CPU the very next second or for that matter, you could be at 0% CPU and still be sort of bottlenecked because you know, what you are doing is network I or something like that, right? So what we have done is since, you know, we have a lot of sort of experience of working within the stack, we've actually gone inside the stack and sort of written code inside the job tracker to sort of look at all the queues and stuff and what's going on. Okay, you know, if you were to let the current set of jobs and queries run with the hardware resources that we have right now, you know, how much time would it take into the future and is that time acceptable? And if it is not, then, you know, well, let's go ahead and like, fire up some more machines, right? So actually, the funny part is adding nodes is actually easy. It's even harder to delete nodes because these systems are fault tolerant but only to a limit. If you have a large job that has one task waiting, like, you know, everything is run. So 1,000 tasks, one reducer waiting, right? Because it's kind of like chomping on some stuff. You take out some nodes thinking that you are kind of done, right? I mean, well, 99% of your job is done. You take out most of the nodes and the system will completely fall on its face because Hadoop will think that all the intermediate data has disappeared and this is kind of ridiculous but like that's the way the system works is gonna rerun everything to rematerialize that intermediate data. So you have to be very careful about how you remove nodes. You have to decommission nodes from HDFS so that the file system does not go into a corrupted state and so on and so forth. There's some references to caching here. I'll talk a little bit more about the caching in some of the subsequent slides. Other interesting things, so I briefly mentioned about spot instances. Well, if you have a cluster with a mix of spot and core instances, how do you place data? Because you know that spot instances can disappear anytime, right? I mean, it's just a matter of waiting long enough for you to hit one unlucky day. So simple rule, okay, well, maybe if it's data that I actually want to hang on to, then maybe I should keep one copy always on one of my core nodes because hopefully they'll never disappear. So there's a fair amount of interesting engineering stuff that is required to just make this very simple primitive work. So the other thing I'm gonna talk a little bit about is cloud storage because that's the other big thing, right? I mean, this is an infinite amount of storage and that's beautiful, right? But that comes with its cost, right? So I'm sure you can find a lot more sort of research on the web. If you take S3 and say, okay, well, how does S3 compare to my local drives or to HDFS that uses those local drives, right? And I'm just presenting some sort of back-of-the-envelope numbers that we have gotten when we have done some testing. They are like 4x slower than local drives. It's very, very slow on small files. Seeks are very, very expensive because you have to open up a new HTTP connection and sort of like, you know, so. So there's some charts here. Vinayak was talking about what not to do for visualization and I think this chart is like everything that he said we should not do. But anyway, what you can see here is that, you know, S3 is the run times on S3 are the blue lines and the run times on HDFS are the red lines and you can see that S3 is like, takes way longer than HDFS, right? So 4 to 5x. The other thing is like the amount of jitter is just unimaginable and it sort of makes sense, right? This is an environment where people are, Gordon was doing what, right? And we are all sharing the internet, the ethernet infrastructure inside the cloud and you see tremendous amount of variation in the latency from S3, right? So even though the best case is pretty good, but like, you know, it could be anywhere. So I think this particular set of data shows that the latency, the run time was like about 100, like 95, but individual runs were like, within the range like 75 to 125 or something like that, right? So it's a tremendous amount of variance. So if you go to the forums, you will find that people have asked this question, the points I'm raising and they've said that, well, you know, there's something wrong here. Let's try and use HDFS, right? Let's try and make things more predictable, make things faster and the thing that they've come up with is copy your data from S3 to HDFS and then start working on them, right? And that's a very good approach, very valid approach. But go back to that analyst, right? Who's sitting on his desk and just trying to write a square, right? I mean, how the hell is that guy supposed to know that in order to have a square run optimally, he's supposed to copy data from one place to another? So, you know, these are the kind of things that we don't want people to think and deal with. So what we've done instead is said, okay, let's just use HDFS as a cache, right? That's what people are doing, right? When you go to this forums and people say, hey, you know, page in this stuff manually, well, maybe a computer should be doing it for that user, instead of the user doing it himself, right? So, just go a little bit into how we build the cache. It's pretty straightforward. So, you know, imagine you have a cluster, it's a MapReduce cluster, and your data is stored in S3, right? Yes, yep, okay. So your query comes along, you read the data from S3 and we do some magic to sort of start caching it. So the one thing I would point out is that it's not just a simple file cache, but it's actually a columnar cache. So if you take a big JSON data, which has like, you know, 15 columns, what we're gonna do is like, look at what you queried and actually break it up into columns and cache the data in a columnar format. And that has like, you know, gets us like tremendous speed improvements. I'll give some of numbers going forward. So this is just a bunch of slides that are in the interest of time. Let me just skip over. You can see what's happening. Like, next time the job comes along, it just like reads from the cache pretty straightforward. It doesn't go and read from the original S3 file. And in the event that you were recycling your cluster, so you went away, the cluster shut down, you came back, we got cluster two instead of one. We still have the columnar representation of the data in S3 that's still a lot faster than reading the original flat file. And so your queries will automatically read that data and you know, of course, populate it back, right? So we'll take care of all the, you know, expiring things from the cache, you know, making sure that the cache is no more than like a certain number of gigabytes and things like that, right? So in a nutshell, things are a lot faster and a lot more predictable. So we've seen like, speed ups from at least 3x to 5x. And, you know, I think we're still like just getting started. Like there's a lot of optimizations that we still haven't made. So I think there's a lot of headroom here in terms of making things even better. All right, so the other interesting part as well, we're using HDFS as a cache, should we use the same HDFS or should be a little bit different, right? Because if a node dies, you don't need to re-replicate data. It's just cache data. So we've made those changes, right? So when nodes go down or they are removed from the cluster, we find out whether the data that they were holding was just a cache data and we say, okay, well let's just drop it on the floor, right? We don't need to re-replicate and like burden our cluster with all that node. All right. So just wrapping this up, you know, so these are like couple of like sort of like deep down things but there's a lot of other things we have done. So to sort of go back to my point about not having the analyst to understand how things are implemented, simple checkbox. I want a quick and dirty run on this one terabyte data, right? I don't care how you do it. You know, you just give me up to 95% accuracy or you do a sort of, you know, whatever you wanna do. So what we have done is we put a checkbox on the site and there's a bunch of tricks that we've done on the backend. We sample data, we will stop, you know, big jobs at 90th percentile. We have like approximate operators for some of the more things that are sort of, you know, hard to do without sort of complete data like count distance and, you know, the analyst just gets back results very, very quickly even though they are not fully accurate, right? Other features, you know, we extract samples to MySQL and let people sort of like test against that. Again, in a very, very transparent manner, you know, you don't have to think about whether I'm going against Hive or MySQL. It's just a subset of data and you can validate your queries and expressions quickly against that smaller data set. So let me just stop here. So I just wanted to give a quick gist of what we have built and take some questions. How do you maintain data locality when you add new nodes to the cluster? How do you maintain data locality if you're planning to add more nodes to the cluster? Right, so, so, if while, you know, you're right. I mean, there's no magic bullet here. So when we resize the cluster and we bring in data, we maintain locality for the duration of those nodes, right? So if the data is there while those nodes are running tasks, I mean, they get locality. When those nodes are reduced, obviously, you know, we actually, as I mentioned, you know, we just drop the cache data. So maybe the key thing here is that things are always better than what they would be if you went against S3. That's the key thing. Are you using any sort of compression technique while you're storing data in the cloud and suppose some user uploads a very large number of small files that's not really optimized to run MapReduce on. So do you join such small files into multiple bigger files or something? Yeah, so we are using compression for uploading data like in our cache. We actually made, just when we got started, we made a bunch of optimizations to deal with small files. So first of all, you know, we made hives combine input format, work with the S3 stuff, so things will get combined. And the second part is we actually also made changes in the way our IO stack deals with S3 so that small file IO is optimized. I think there is, the second part of this question, actually, I think we have not finished doing the stuff. So one part of the question is, well, if you're going to cache a bunch of small files, do they show up then as a bunch of small files in your cache or they are then sort of like packed into something bigger? And longer term, yes, they need to be packed into something bigger, but right now they are not. How do you tackle your dependency management means? It should have a lot of dependency. What happened? What happened when we were dependency dives and... The dependency management system, we built a V1, so we have used Uzi, but we actually had to implement mostly, the stuff actually has been re-implemented. So our basic pattern is that we use data dependency and not job dependency. So when you say, learn this query, you say that okay, the data I depend on are, for example, the partitions from yesterday. And those are things that we have built into our Uzi stack where our sort of Uzi coordinator will take care of depending on those partitions and only then spawning your query. Thank you, Jaydeep. Thank you so much.