 Hello. OK. So the goal of my talk, really, is to persuade you all that Hadoop isn't a scary thing and that everyone should use it. So I don't know how successful I'm going to be, but I'm going to try. Just a quick question, who here has ever used Hadoop before? OK, so like three hands. So I have a lot of people to persuade. OK. So I'm going to talk about, I'm going to kind of go through the stack. I'm basically going to try and lay out what you can use Hadoop for and why it's useful and why it's the best tool to do the certain things that you can use it for. So I'm going to go through an overview of Hadoop, like what it is, what it does. Then I'm going to talk about the things you can do with Hadoop. So like why to use it, basically. Then I'll talk about why it's a great model to use it for those things. And I'll kind of wrap up with some cool additions and then have any questions that people might have. OK. So this is my summary. So if you guys want to go and hit the cowbell, you can just look at this slide and then leave. Basically with Hadoop, you can generate awesome production data sets, use it to optimize different algorithms you may use, like search algorithms. And you can use it for analytics. And it's good at all those things because it's a batch processing system. You can access tremendous amounts of data without taking down your database. It has all the retry logic associated with it. And it assumes the machines you're using are failure-prone. And it can retry and do all sorts of clever stuff to make sure your jobs don't fail. And really, the big point is that you can use massive data sets and generate a lot of data. And it's really useful. So I want to give you a bit of context on Foursquare because I'm going to use examples from our stack to illustrate my points. So if you guys don't know Foursquare, we're the app where you check into a location. And then we can give you specials and badges. And you can be the mayor of wherever you are. The other feature we have is called Explore, where we offer socially-driven recommendations. So it's like a search engine. And we optimize it like a search engine. So that's where the optimization point comes in. But there's a lot of good things in our platform that require a Heddyp data processing pipeline. Oh, yeah. And in case you didn't see those numbers, we have a lot of data. So we have 20 million users, over 2.5 billion check-ins. And we generate like 300 million event logs a day, which is like 500 gigabytes. So Heddyp overview. So what is Heddyp? So Heddyp is two things. It's a MapReduce framework. And it's a distributed file system. And these things scared me to death when I started working with Foursquare because they were like, hey, good job. You're the Heddyp guy now. And I thought it was going to be really scary. But actually, it's pretty simple. And I can hopefully do a good job of showing that to you. It is not an analytics platform. So most people like to throw around, like we already heard about data scientists, right? So they like to throw around the word data scientists, big data, analytics, big data analytics, like all this big data statistics, whatever. Heddyp isn't really designed for that. You can use it for that. But it's really more powerful in the way that I use it and the way that you can use it. And that's as like a production component of your infrastructure. And versionings are pain in the ass. So there's a version one of Heddyp. Don't use it. It's a big pile of shit. Basically, use version 0.20 and then 0.2 maybe or maybe 2.01. It depends. You're basically just better using somebody else's distribution. And they've fixed all the bugs that are inherent in all the versioning. OK, so scary slides first. So this is the kind of infrastructure of Heddyp in terms of how you lay out your machines. There is a master. So blue is MapReduce. Orange is a distributed file system. There's a master node. You talk to that node. It delegates tasks to its children. So the job tracker, you give it MapReduce jobs. It sends the task trackers, maps, and reducers to perform. The data node, you tell it to store this file. And it will send blocks of that file to the data nodes, which are orange. And they are redundant. So you can say how many of these data nodes you want holding a particular piece of data. Default, I think, is two or three. It really depends on how confident you are that your machines aren't going to die. So it's also important, before I move on to the clever stuff, if you co-locate task trackers and data nodes, you get benefits because the data's in the same place. So clever stuff it does. So it assumes your machine's a bad. So if a map task fails, for example, it'll get retried on three other task trackers. Your data nodes are redundant. So a piece of data is in two places, or maybe three, or four, or five, or however many you want to specify. You can have many users, so if it's like scheduling, you can tell users they're going to store 100 gigabytes, or they can store whatever they want, or they don't get all the resources, et cetera. And you can use S3, like it's HTFS instead, which is really nice. So map and reduce. So the bulk of everything I do is writing map and reduce functions. This is kind of Ruby pseudo code for map and reduce. So these are like the Java APIs that I'm like fake programming against with Ruby. You basically, a map, you get a key and a value, and you output a key and a value, and whatever you do in the middle is up to you. So you can take, you know, in this example, I take a line number and a check-in, and I output a user ID and a check-in. And the reduce function gets a key, and then a list of all the values you've output with that key. So in this example, I get a user ID and a list of check-ins, and all I do is iterate over the check-ins and count them, and then output that at the end. So I get a list of ID's with a bunch of check-ins to see how many times each one is checked in. So it's pretty simple, but like when you have billions of records, this is pretty good. It's like one of the only ways to do it without taking down the database. So this is the other scary diagram, and this is the map-produced pipeline. So in blue is what we just wrote, map and reduce, and all the stuff in the middle, basically what that does is organize the values and the keys so that the same keys are together and then give some all to the reducer the same go. And you can override that stuff, but generally speaking, you won't really need to touch it unless you're doing clever data joins or something really intelligent. But generally, you know, map and reduce, everything's good. The stuff in the middle just organizes it so that the reduce interface works. So before I jump into some examples of, like, why it's good, this is our pipeline at Hadoop, right? So at Hadoop, at Foursquare. So we have, like, two types of data. We have log data, and we have database data. Every night we take a fresh snapshot of all our databases and put them in S3, and every day we collect all the logs we generated through that day and put them in S3. That gives us a really consistent view of what Foursquare looks like every day for us to run jobs over. Event logs are probably 500 gigs a day uncompressed, and mongos are less than that, but that's all of mongo. So we don't let every day, so it's probably a few hundred gigs. Okay, so obviously everyone here is rubious, right? So we use Scala at Foursquare, but for a long time we used Ruby, because I used to use Ruby, and I didn't see any reason to stop. And to do that with Hadoop, you don't have to use JRuby or anything crazy. They have this streaming jar, so you can just basically give it two scripts, a mapper and a reducer, and it just pipes the data to them. So you read from standard in, you write to standard out, and that's pretty much it. You just put a tab in between them, and so the one before the tab is the key, the one after the tab is the value. And if you want to implement partitioners and sorters, you can do that with Ruby as well, or with just regular shell tools. So here's like a quick example, so this is a mapper at the top. I read from standard in, I read each line, I split it into key and value, then I do some computation, and I'll put the new key and value. And I run that with this pretty simple command line, and I give it a streaming jar, and I give it the inputs, the outputs, and the mappers and the reducers. So obviously, you want to test it, you want to do all those things. There's a bunch of Ruby frameworks. The framework called Wukong is really awesome. We used to use it a lot at Foursquare. It basically gives you nice interfaces for your mappers and reducers, and makes it so you can test them and so you can call them without using Hadoop. So you can use the same code like elsewhere in your stack, basically. No, never tried it. So I can't read comments. So that was kind of the boring bit with the diagrams. It's going to be really confusing, so hopefully nobody's confused. I want to talk about things we can do with Hadoop. And I'll largely just rely on examples for this. So I'm going to talk about the generally best things to use Hadoop for, and then I'm going to give you some concrete examples from Foursquare, and I think you'll see why it makes sense. This is really the core of it. If it makes sense to you why we're using Hadoop for these things, then hopefully you'll go away and maybe play around with it after the conference. Okay, so generally... So these are the three things I'm going to talk about. So generating data sets, optimization and ad hoc querying. So generally with generating data sets, they follow this pattern, right? You want to do the same thing for a whole load of records, and then the output is going to be like a key and a value. So this is the user counting that we did, right? User check-ins counting. We do the same thing for every user. We want to count check-ins for every single person. And the output is a key and a value, the user ID and the count of check-ins. So the examples I have... So this is... Like who has done what? So this is a snapshot from Foursquare.com. And we recommend you a place if more or fewer friends have been there. So in this example, like a bunch of my friends have been to Blue Bottle Coffee, and that's like Hadoop Generator, right? Because we basically want to count every user-venue combination. So that's 20 million users times an average of like a thousand places people have been. So that's a lot of records. So imagine doing that in like a queue, it would totally destroy your stack or your database. You'd have to write throttling, all sorts of crazy stuff. If you do it in Hadoop, it's a very simple job. You're just doing a count, right? So an example that doesn't involve counting, because most things in Hadoop involve counting, is generating newsletters. So we want to pull together everything your friends have done over the last week and generate an email that tells you about that and gives you kind of interesting things. So you not only need all your friends, everything your friends have done, everything your friends have liked, you also need what have they been looking at. Like what did they really enjoy doing? Did people comment on their check-ins? Did people comment on their tips? And you can pull that all together and analyze down the Hadoop pipeline and put it back in a key value store fuse. Third example, suggested friends. So I'm going to do a counter example of this. So we basically take friends of friends and then see how many of them your friends having coming with them, right? So you can see I have eight friends in coming with Alessia at the top. I don't know who she is actually, but it's okay. So maybe I should be friends with her. So it's a very simple job where we output a user ID and a list of user IDs and just generate a list of people we think we know based on the friends of friends algorithms. Pretty simple. So counter example, if we did this with a database and a queue, we'd have to pull the user record and we'd have to pull all their friends, so maybe a thousand friends. Maybe each of those have a thousand friends and you have a million records. Then you have to rank them and return them and write them to the database. So if you have 20 million users, you're doing this 20 million times. So I think it's clear that that's not necessarily something you want to be doing like peak load during when your application is running. If you do it offline and then load it back in, you have very little impact on the system. I kind of alluded to this, but you're best not putting your data back in a database, you're best to put it in a key value store because all the data that comes out is key value. HBase, Cassandra, Redis, even Redis works really well. This is some more fun stuff. Coffee or hot dogs? Who drinks coffee at 3 in the morning? Yeah, exactly. It's pretty intuitive that you don't drink coffee but we wanted to verify it to add it as like a signal to our ranking algorithm that we don't recommend you to go get a cup of coffee at 3 in the morning because that would be silly. So this is a visualization of that. So you can see that Gorilla Coffee is a coffee shop, Graze Papaya is a hot dog place and the Green Line is a nice restaurant. So obviously people go to restaurants in the evenings, on Fridays and Saturdays, people go to coffee in the mornings and all day Saturday, we don't have much hot dogs all the time. So this is a really good signal. So you can get some numbers on this and you can help it to improve. We have like a linear function that computes our search rank. So this can be a feature of that. And even though we don't necessarily load this data back into our system directly, it's really useful because it's helped us generate this signal and help us quantify how much of an impact time of day and day of week and how much results. So another example, apparently people don't get ice cream in January. So this is a chart correlating check-ins to ice cream shops with the temperature or the time of year. So again, intuitively, as it gets warmer, people eat more ice cream. But again, having this as a signal in the search algorithm is allowing it to be more powerful and you need to validate it and make sure that you're not seeing something crazy. So a third example which is to do with optimization that's not to do with search ranking is what shape is a venue. I think this is pretty interesting in its own regard. So people go to check-in to all sorts of places. So JFK, Golden Gate Bridge and the Blind Tiger, which I think is, I don't know what the Blind Tiger is. JFK airport, obviously, is like a big place. So it has a big shape. All the black dots are check-ins. The venue, like, you could be in a lot of different places if you want to check in a JFK. Like, how do we know that you're actually in JFK if we don't know the shape of the venue? Equally, the Golden Gate Bridge is like a big line. So it's not a circle or a dot. Most people presume places are dots. It's definitely not a dot. The Blind Tiger is more a dot, but it's kind of halo of things, of, like, check-ins that spread out as GPS is on... GPS on phones isn't particularly accurate. So being able to capture, like, the percentile ranges of where people are in relation to a venue allows you to kind of surface the right venues or give them more contextual information. So, again, this is computed with Hadoop. You basically take a bunch of flat lungs and a venue ID as the input, and you output, like, a polygon of, like, what the place looks like with a bunch of confidence intervals as to how far out, in what direction, and that kind of thing. It's pretty cool. And then, like, measurement. So you can find problems in specific areas. Can you see that? Does that show up? It doesn't really show up on here. So, like, this is APR request from a specific region, so, like, New York City is notoriously bad for AT&T. So this is, like, looking at New York City, you can see there's, like, blips in the graph of, like, APR request from New York City because AT&T goes down. And this kind of stuff, you're like, it's not going to really impact... You don't need to change anything in the app, but it's nice to know, like, if people are experiencing problems to debug it and say, well, why are they having these issues? Is everybody having this issue? You don't necessarily get this from, like, real-time server tracking, but you can find it, like, after the event by going and digging through this data. Cool. Okay. So the third use for Hadoop, or Foursquare, at least, is adult querying. So before I even talk about these, like, these are the hardest things to Google for in the world. Like, if you Google for pig, you're not going to find this like, there's lots of things about real pigs on the Internet. But essentially, they're scripting languages that generate MapReduce jobs instead of compiling them to, like, real, you know, to, like, assembly language or anything. So we use Hive a lot. Hive is SQL. So you specify SQL tables, you tell them where the data is, and then you just write SQL queries on top of it, like it's... like it's a real database. So I'm going to give this a couple of seconds and then we'll get to the next. Does anybody know what this query returns? So, yes, that's true. So it returns the rudest cities in the world, based on the number of tips left in those cities that have curse words in them. So, yeah, Manchester. We had the mayor of some of these towns, emailers, and asked us how they could improve their rating. Which is pretty fun. Oh, yeah, so at the top, at the top is being bad. But, yes, improving, they might want to be rude to their tourists, I guess. So I think one of the towns towards the bottom had a, like, we are nice day to counter this. But anyway, that kind of thing's really powerful. Like, we didn't write that query like a business analyst at our company wrote the query. It really takes a burden off all the engineers to report all this data when they can go and just find anything they want. As long as they know SQL, they can ruin any query they like and really get any result they want. Obviously, getting to learn SQL is a hard bit, but once they have, it's great. It doesn't really work well with the analytics tools you normally put on top of databases. Okay, so interesting things that you can do with it. So why is it good for those things? You don't need to sample. You don't need to throttle. You don't need to deal with failure recovery. What if your Q job fails? You need to do any of those things. You can just access all your data as fast as you want, as fast as it can. And in combination with other datasets. So, you know how people are using your app, but have you ever, like, correlated it with usage on Twitter or, like, some search terms on Twitter? What about Wikipedia articles? Do you have, like, a wiki on your site? Do you want to join it up with Wikipedia in some way? Like, you can do that with Hadoop and you can join again with pretty much any kind of dataset you want so long as you have some way of joining the IDs together. And even that can be kind of fluffy and you can kind of hack it to join on lesson, like, a definite ID. So reusing your code is really, really the important bit, right? So you want to reuse all your models. And that goes for event logs as well. So event logs are awesome. You, you know, just write data out. It doesn't matter how much you send, hundreds of millions of events a day. It's, like, super useful. Like, who uses the web? Who uses the mobile? How long do people spend on it? Who goes to search results and then clicks through? Like, who buys something? All these things, like, how do you optimize that? How do you make it better? You don't need to really plan for any of these questions as long as you record enough information. Throw whole objects in your event logs. We throw, like, whole venue records. We throw a whole list of venue records. And we log, like, every result we give to, like, our search engine. So we can analyze it later and see where people clicked and see what their venue was and see how good it was. And the more you log, the more you have access to. So code reuse. So, like I said, this is really important. I kind of thought this slide was before the other one. So you really don't really have, you don't really have to do much to use Ruby with. Use your Rails app with Hadoop, right? You just need to have some way to serialize and deserialize your records. And you have to make sure that those records don't expect a database to be living behind it. So that, you know, you don't accidentally, for every record in your entire database, try and pull all check-ins for all users right at the same time. Because that would be bad. Yeah, and if you use records for event, models for event logging too, then you can really have the same code in your app. So if you want to go and click in your app and, like, generate a newsletter for somebody because they didn't get it or then you, you can click a button and it'll do it. And really, you can just push it through a MapReduce job and it'll do it as well. And that's really powerful. That's kind of the core of what I'm trying to get at. It's like, you can do the same things as you do in your app, but do it for every single person or every single record. It's really, really cool. All right, so a few additions before the end. So if you've ever played with Hadoop, it's like a total nightmare to set up and get configured. Basically. Amazon has their own distribution. People are on EC2. You make one API call and they'll spin up however many machines you want, properly configured. You can write to and from S3. And all you have to do is submit your Ruby job or your Hive query. And then you have flags to set up Hive and, like, install RubyGems and, like, all this cool stuff. So you can get a Hadoop job and Hadoop cluster up and running in, like, 20 seconds flat with this thing. It's really powerful. If you want to take it seriously after that, like, don't try and go to, like, a patchy code repo and, like, download, like, the Hadoop version one and, like, try and run it, because, like, nothing will work together and it'll just be a disaster. Instead, use, like, CloudDera is a good company. They have their own distribution. All the components work really well together. So some other cool frameworks if you don't want to use Ruby. So we've already talked about, like, Hive and Pig, scripting languages. If you want to use Java and Scala frameworks that we use, Scooby is the one we use the most. So if you want to play around with JVM languages and those are the ones to use. And you can always use JRuby and use any of these frameworks. When we use Ruby, we just use straight old Ruby and use Hadoop streaming and that was more than good enough for us. Then when you really get serious, you probably want to schedule things to happen every day. So you want one of the workflow engines and we use Uzi at Foursquare. It's really good, except you have to write a lot of things that aren't so good. And then some event logging frameworks in case you don't do this and you want to do this because this is really an integral part of having a really good Hadoop cluster. All these things do event logging. We started using Kafka. We did use Flume. Flume is bad. Kafka is good. That's the most I have to say on that. So yes, that's it. So hopefully I've kind of given an idea of what Hadoop is, what it's good at, what it's not good at and why you should use it like in your app to do a bunch of things that you really can't do any other way unless you want to have like complex like throttles to stop your database from falling over. So that's really it. Does anybody have any questions? Yes. So the question was generating the newsletter how many maps and reducers like going to doing something like that especially because you're using so much data I guess is the point, right? So the answer is like probably two map reducers. So again, if you use one of the frameworks they basically build DAGs for you. So you just specify like one thing and it'll like chain the map reducers together. So you really, so in that example you have like one map reducer that can pull in lots of different data sources. So users check-ins, likes, tips, whatever. You key them all by the same ID so the user ID. So they all go to the same reduce function and then you just squish them all together into like one model and then write that out. So what your second map reducer gets is like the user ID and this giant model of a user with everything he's ever done like check-ins, tips all that kind of stuff. Does that make sense? And then you can use that to like maybe you key that on then all their friends so then for each friend you can like collect all that together and create some big aggregate of that. Even if it's just something simple like friends checked into the most. I think the example I gave actually just looked at places that people have most checked into and surfaced those. There's a lot of other things you can do there and you can plug that into any kind of machine learning tool or anything to get it done. And we have time for more questions? Any others? Okay. Thank you very much. All right, thank you.