 Thank you for the warm welcome and welcome to my talk on Apache Hadoop It will be a talk on large-scale data processing, which is basically very hot topic currently But first of all, I would like to explain Who I am My name is Isabel just as announced I am the organizer of the Berlin Hadoop get together and I am co-founder of Apache my heart What is Apache my heart yet another name my heart is a Library that intends to implement machine learning algorithms said scale scale in terms of community have a vibrant community have Very lively mailing lists scale in terms of has a commercially friendly license and of course scale in terms of scale to large Data sets to huge amounts of data to drain on in order to reach the last goal most of our algorithms currently are based on Apache Hadoop and I'm implemented on top of this framework and At daytime I'm a software developer in Berlin. No, I would like to know a little more about you Whoever has seen a talk by me and knows that I'm doing Hadoop get-togethers in Berlin knows that usually I Take this microphone and give it to the audience and ask you stupid questions But there are so many people here. I'll just do it a little differently this time Please would you raise your hand if you know the term Hadoop? That's awesome How many of you are actually Hadoop users? Okay, good Next question. How many notes does your cluster have ten or more? 100 or more 1000 or more Okay, good There's some more buzzwords who knows about zookeeper Quite some people anyone aware of hive Some more How about H base you should know about that one. There was an interesting talk at the nose keel death room this morning Anyone aware of pick I mean not the little animal, but the project actually Lucine I want to see all your hands Okay solar Any solar users great Anyone know about my heart before I told you about it anyone using it No one. Yes, there's someone I want to talk to you after my talk Okay, what I'm am I gonna go and talk about first of all we have a chapter on collecting and storing data Next chapter will be on analyzing data. I will give you a short tour to Hadoop Tell you what's coming up tell you a little bit about the history and Last but not least there are a few slides on the Hadoop ecosystem because whoever raised his hands During the questions I could just ask you knows that there is not just Hadoop's the core project But there are many satellite projects that make working with the framework easier Okay, collecting and storing data If we go see traditional way and have a look at where data is Collected we may come up with an example like that. We have a shop We have products in the shop and we want to collect information on how many products do we have which price does each product have How many products that we sell information on our customers where do they live and so on and so forth? First solutions that comes to mind is regular relational databases like my SQL Postgres or Oracle So is the data in a relational model? Analyze the data maybe put it into a data warehouse Maybe run all AP queries over it But what if what the data that I have is not really relational data? What it's maybe a bunch of log files Say transaction logs from your regular webshop or Say something like query logs if you're running a very successful search engine You may want to track which queries users are actually searching for to improve your system and What if those log files Scaled to the point where they don't fit to your regular hard disk anymore So you end up with data that cannot be stored on a single machine And you end up the status that cannot efficiently be processed in a serial way logical consequence would be to use Multiple machines process your data maybe to build a little cluster distribute computations this user distributed file system and Just go with that There are a few challenges when doing it this way First challenge of you are all computer users You know that single machines tend to fail see Mac book I'm using here to give this presentation Had a hard disk failure 12 hours before I gave my talk at Hadoop user group UK last year So I was pretty happy to have a back up of my presentation If you have not only a single machine But like a data center with multiple machines each machine you add Increases the probability of any of those machines failing So what do you want this framework that gives you a built-in backup built-in replication and built-in fail over Now you need someone to write programs to analyze the data If you have a look at typical software developers Usually if they come out of university and are not as brilliant as foster visitors They never have dealt with large amounts of data So they don't know how to handle petabytes of data and don't know all the intricacies that come into play when writing writing parallel programs and Usually for pro for project You don't have time to actually make software production ready and was production ready here I mean something like it has defined failure modes It has failed over of a machine crashes It has to find error codes or something goes wrong and so on and so forth So you want something that is easy to use basically something like parallel programming on rails And if you're thinking about using an open source framework, you want something where bugs are Regularly fixed when new features are added and where your patches are integrated and into a system so you want something with a vibrant lively development community and last but not least the guy between you the developer and your customer is an apparition sky and I think I may promise you that he's gonna yell at you if the system isn't easy to Administrate and he'll probably also start yelling if for every single little applications that you write you're using a different framework So you need something that is easy to administrate and you need something That is kind of a single system that maps to a quite a lot of tasks that you want to solve That is when you may want to have a look at a head up. It's easy distributed programming So for me as a developer, it makes it easy to write distributed applications without a very very deep Background in parallel programming just to be the cluster program programming or HPC It's well known in industry and research It's used by companies like Yahoo. It's used by Facebook It's used by the New York Times by last FM and many more and it scales scales well beyond 1,000 nodes So where does it funny little project come from? What is the history behind it? well, you may have heard of the Mapogeus implementation of Google So it was done in 2003 At about the same year a paper was published by Google on the distributed file system GFS and Another year later the Mapogeus paper came out didn't take very long long until duck cutting the original author of lozine Reported that much which is a internet scale search engine makes use of Mapogeus it didn't take long again to grow For the module to grow so much that it were entered an extra project beside much In 2007 Yahoo reported running a first head of cluster with 10 was 1000 nodes and Like two years ago. It was finally its own top-level project at Apache So last summer just to show you that the framework really works Yahoo has won the petabyte sorting benchmark with a head of cluster so what are the assumptions that are Underneath the framework that you should be aware of if you're writing Hadoop applications First assumption as I already mentioned is that the data does not fit on a single note it What comes out of that is that we want to use commodity hardware, so we don't want to use The PC's it's underneath the desk of your secretary. It's still kind of beefy strong hardware But it's not dedicated. It's not a dedicated hardware What comes out of that is that failure happens The idea is to distribute the file system to build a replication into the file system Built in replication means that every files that are stored in the file system is replicated by default Two times so it's available three times And you have automatic failover in case of failure Second assumption is that you have so much data that it's pretty expensive to move the data from You know from where it's stored to where it should be processed So the idea is to turn the whole model around and move the computation to where the data is and keep Computation local to data third assumption is well, this seek is Very expensive compared to Continuously scanning files. So the API's that you have available in Hadoop focus on making scanning data very easy, but they don't Make it easy to write Applications that need random access to your data So you need to reformulate your algorithm such that you can stream over the data if you go to the website Hadoop Apache org and download the package Basically, what you end up with is two kind of components one is HDFS the distributed file system And the second is in my produce engine We'll have a look at each of these in the coming slides First of all the distributed file system If you install that on your little cluster What you end up with is one class one node. They call the name node holding file metadata that is Each file basically is split into separate blocks and the name node keeps Information of on which note each of these blocks is stored Besides that you have a several worker nodes called data nodes that actually stores the data So basically you could compare the name node to holding sort of the I know table of your cluster What this means is okay? Our name node stores file metadata It stores that metadata in memory and it starts a mapping from block From file blocks to actual nodes. So that means if you store that in memory. This means that the size of your cluster Depends on how much main memory do you give to your name node? And it depends on how large you make each file block if you make the blocks large You can store a lot of data because you don't you don't have so many blocks per file If you're writing a program against HDFS, how does writing a file look like? Let's assume you write an HDFS client that runs on a client node Logically first thing it does it goes to the name node tells the name node I want to create a file and the name no tells it. Okay, so you can go See file should be stored on this data node After that the client goes to the data node stores its data And as I mentioned earlier it the system has replication built in So the data node goes and Pipelines see data to its to sort of slaves After replication is complete and all is written to the system your method call will return Replicate the replication strategies that is used as basically a trade-off between Bandwells that you have between nodes and between distributing your file Evenly across the cluster in order to minimize failure spreading You don't want all three replicas on one hard disk obviously, but maybe you don't want them in one rack as well So if you have a look at it at an example We may have our client on the left-hand side to optimize bandwidth this client may write to see to its own Data node this one replicates to a different rack and on this rack. It's again replicated to a different Data node fire read looks similar you have your HDFS client It talks to the name node and tells the name node. I want file X name note tells my client okay this file is Distributed across these data nodes It goes to the data node streets of lots and gets the information So now you know how to store the data you know how to read it back You know how to interact with C files of them on a sort of coding level But what do you really want this right programs that analyze your data in order for you to reduce information from it? That's where the MapReduce engine comes into play To explain what MapReduce is all about how many of you have written MapReduce programs? I should see yeah quite a fair amount. Okay Okay, one example Takes this little XML file. It doesn't look very pretty. It's just a snippet from the RSS URLs that are in my RSS feed reader Goal would be of this task would be to read the file extract host names of each block and extract the top 10 host names of blocks that I read So I would to do this on a standard Linux machine with Regular tools I would do something like that and come up with a list of Okay, they're like 10 RSS feeds from archive zero six RSS feeds from Google and so on and so forth If you have a look closer look at how that's done it would probably look something like that You define a pattern that kind of looks like a host name. No guarantees. This is the right regular expression for a host name It's just for me for the example You would crap over the file You would then sort by host name and finally Count how many unique host names you have in this list If you map site over to map reduce What do you end up with as a map step? For crapping over the files you have a reduced step for counting and what the framework does is the shuffle phase Basically your map function would look something like read a block of data remember that in this case, this is not a Kind of feed URL file. That is just a few megabytes in size But maybe one petabyte one terabyte. So it may be distributed across the cluster So our map function is run Exactly on those nodes holding the correct fractions of our data map functions then extracts key value pairs where keys are the host names While you maybe for instance, how often did I see this host name in the current block of data? If I write the reduced function, I have the guarantees that I only see that I see all Key value pairs of one key type for one call of the reduced function So summing up is very easy I just iterate over all the values and put out key and some mapping If you have a closer look at what this may look like it's something like that I read the data from the HDFS. I have multiple map tasks run all over the cluster each of these map tasks outputs key value pairs in this case host name and counts This intermediate Output is shuffled and grouped by key And in the end I have to I have reduced tasks that compute the final results If you have a look at the Java API that may look like something like this If you are used to sort of the old API of Hadoop or dot 18 this one is a new one It's a lot cleaner and more compact In the map function again, you get a key value mapping in this case. It may be something like filename and content you iterate over the content and extract host names and The context object gives you a way of emitting key value pairs What's the context object gives you as well as a way of emitting sort of counter values counters can be used for instance to Provide the framework with a number of Bad records that you have encountered should see it just skip the record and keep a statistic of how many Bad records were in the file But the context also gives you a way of telling the frameworks that you're actually progressing Because what Hadoop does is if your job is running for too long So and it cannot really decide are you in a endless loop or are you really doing work? So you can set up a timeout After which the job was killed so like after zone if I don't see Progress updates for that many minutes seconds, whatever kills a job Which of course is bad if you have a long-running Mathematical job, so you want to tell the framework? Oh, yeah. Hello. I'm still alive. I'm not that and I'm doing Sensible work, so please don't kill me To reduce jobs and simply sums up the values and outputs So result if you have a look at our little picture For MapReduce, we now have a special note called the job tracker That is the note that we connect to to submit MapReduce jobs And we have on each slave node task trackers that really run the map tasks and reduce tasks What does a MapReduce job really look like if I write my client application on my client note? This application contacts the job tracker and tells it. Hey, I've got a job for you to run Job trackers and has a look at where the data really is located contacts the machine On that machine is a task tracker This task tracker is responsible for scheduling low for local scheduling What the task tracker is does it starts a JVM on this node Runs the map job or the reduced job in this JVM and returns its output Why does it run in a separate JVM? Well, as I told you the famous Framework should be robust and being robust means that client jobs that are run Shouldn't crash the whole framework and it even shouldn't crash a task tracker If I have a client job that crashes my client JVM, I want it to be in a separate JVM So this also means that if I run MapReduce jobs that are very very short Said there is a lot quite a quite a bit of overhead But the assumption is that really I have said much data on my note said the overhead of Starting another JVM doesn't count so much And of course again, you have not only one task tracker but multiple of Some so if you were to start your own hacking how do packing what do you need in terms of hardware? In terms of software, it's easy you go easier to your Apache Hadoop website and download the Hadoop distribution or you go to Cloudera you stick Hadoop distribution Or you go to Debian people have some package up Hadoop for Debian So people working on it at the moment so they have Hadoop on the pipeline So it may not be long before you can just type up get install Hadoop and get your system up and running Well, what do you need? Probably this is a dream of everyone right? Running in a day in a data center on thousands of nodes just happily hacking along If you don't have a data center, you may use other people's data centers anyone not aware of EC2 Okay, so you just run machine time from Amazon you can Get your own Hadoop cluster up and running on easy to there are a lot of how-to's and there are also AMIs That makes that very easy If you just want to play around So you're smart produce as a service at Amazon it doesn't use HDFS but its own back ends But it's pretty nice to get started and get playing around with MapReduce in the first place However, be careful because elastic MapReduce uses the sort of old MapReduce API not yet the new one If you don't want to go into cloud computing you can set up your own little Hadoop cluster thanks to Teelo for Installing the Hadoop cluster and thanks to packet and mask from the CCC Berlin for providing some of the hardware But you really don't need large service Actually, you can start playing around with just a tiny little laptop like the one over here He can run Hadoop and sort of single single node mode and sort of see to this distributed node mode to the distributed mode And get your get your programs run locally This is also very handy if you want to debug programs if you want to develop new stuff Then you probably want to try out try it out on your development machine You don't want to go to your cluster and fire up the debugger and Debug in a distributed way you want to do this locally on your machine probably even was in the IDE So what is up next? Next up in the ODOT 21 release There will finally be a pen in HDFS So far you can write files and can close it and that's it The goal is to facilitate Opening files again and appending to the end of it. It's a pretty long story of getting a pen into HDFS And there will be more advanced task schedulers In ODOT 22 there will be more security Currently there are used sort of user rights But there is no real hard security on on data on the cluster So the basic assumption is it runs in your data center and they are safe There will be Afro based RPC So that RPC is compatible across Hadoop versions. Anyone ever heard of F4? Hey, great Afro basically is a RPC and serialization library That is a sub project of Hadoop and written by Doug There will be symbolic links and there will be federated name nodes. No more single name node So who's using Hadoop? We may hear a little bit about it in the next talk by Facebook Hadoop is used by Yahoo for lots of analysis It's used by Lastfm for coming up with recommendations it's used by the New York Times for scaling images and converting images It's used by search engines like deep dive for text analysis and it's used by several search engines that are based on Nuts or on cutter So if you have a look at Hadoop at a broader sort of scale There's not only the core project, but several people working on Other projects that make it easier to handle Hadoop that make it easier to write MapReduce jobs that make it Better for data storage Let's first have a look at higher level languages Some of you may know some of the logos just to motivate why you need higher level languages I'll give you an example I'll take the example from PIC. I Picking it from PIC because I Saw the presentation on PIC one year ago And they had a very very great example motivating why you need something like a higher level language So suppose you have some user data in a file you have website data in another file and you find the find out The top five most visited pages by users aged 15 to 25 Sounds like a pretty reasonable task to do isn't it? So you need something to load your users load pages you need something to filter users by age You want to join both our name you want a group on the URL visited You want to click you want to count the clicks on each URL you want to order by click amount and you want to take the top five If you were to do this in Java code, it would look something like that You're not supposed to read it up there in the last rows. You're even not supposed to read it over here in the front If you're doing it was PIC it looks like something like that And I hope that even the guys over there in the back can read it So what you can do with it is write pretty easily one of jobs for data analysis Of course, you pay some overhead when running these jobs But on the but then again, you don't have to write ours loads and loads of Java code and It's easier to understand of course There are some projects making it making it easier to distribute storage as I explained its HDFS is not optimized for random access. It's optimized for continuous files So what if you want sort of kind of random access and what if you have semi structured data? Then you may want to go for H base or you want to go for Cassandra or you want to go for hyper table which are all three Based on HDFS There are a few libraries built on top of Hadoop and There are a few libraries that make handling your data easier You may store Your files in a plain text format, but that obviously isn't very efficient neither neither space efficient nor Time efficient in terms of parsing time What do you want is sort of a binary format? But you want a binary binary format that you can upgrade easy easily That's where something like after was rift or Google protocol buffers come into play Then a lot of innovation happens around Hadoop As I mentioned already, there's a project building Machine learning algorithms on top of Hadoop. We have clustering algorithms for grouping items by similar similarity we have classification algorithms for classifying new incoming items into predefined categories well-known example is Spam email classification So you have two categories mails that I want to have mails that I want to throw away and I want to learn a classifier set separate Data points into these classes. That's what you can do with my heart and of course you can do recommendation mining So this like if you go to Amazon and you buy a book Usually Amazon tells you people who bought this book or also bought length sees and such books That's something you can implement with my heart and of course there are also search engines that are Making use of Hadoop to distribute indexing So I've done a lot of advertisement for Hadoop just three final slides of advertisement Why should you go with a project? first of all, it's proven code it works in practice it works in production and You don't need to reinvent the wheel next it's an Apache project and they are Mailing lists that are very lively very lively discussions on the mailing lists people who are willing to help you So come to the mailing list if you are a Hadoop user and provide input By the way, if you're my hot user same applies to you Last but not least it's very well possible to become part of the community if you have a look at the Graph here in the bottom. That's just the Growth of emails and on them all had to be mailing mailing lists. So the community is still growing steadily and there are more people From inside as well as from outside Yahoo people working full-time people working in their free time on Hadoop So final advertisement if you're using Hadoop come to the mailing lists talk to us One advertisement in my own interest. I'm doing the Hadoop get together in Berlin Next run is on March 10th. If you need an excuse to visit Berlin There will be three talks After 5 p.m. There will be beer after the event There will be lots of interesting discussions and lots of interesting people there And there's another Hadoop event in Berlin looks like this year is the Hadoop year for Berlin There will be a Berlin buzzwords event on June 6th and 7th talks on the topics storing data Searching indexing and scaling to large amounts of data are well welcome We welcome talks on Apache Hadoop on noise no SQL databases on distributed computing on business intelligence applications on search applications on scaling search indexes and On cloud computing in general So you already noticed that when I'm telling you all those Names and words and tokens. It should be easy to play bus with bingo at that conference So that's my final invitation to come to the mailing lists Thank you Hi, I was wondering what is being done to make HDFS mountable as a regular file system How How can you mount HDFS as a regular file system? Does that work yet or it works with fuels? With views you can you can mount it with views. Okay, and is that does that work nicely or is it like? For me it always worked that you should be weird that it's not a POSIX file system So it doesn't support all the POSIX Hello, maybe I missed this before but you're mentioning the Amazon elastic They had an old map we use So what's the difference and what's the advantage of the new and yeah The new one is more compact It isn't so Verbal and the signatures of your map and reduce functions have changed a little So it's kind of easy to port your jobs. It's not kind of sort of a very fundamental change But you have to add up to the new API I'm trying to follow up to that first question. I'm afraid I didn't hear your answer to it all that clearly, but What I'm wondering is Suppose I wanted to implement an STDIO like layer over it and be able to Just save to Hadoop from my word processor my image processor whatever file. I'm using how far are we from being able to support that? Please guessing that is probably far away Can you just fancy microphone over From from opposing system you're pretty much far away because you can't Reopen files and then append so you just can you can you can write once and then close it and maybe read it again But another pen does that's coming You also cannot insert into files So see it doesn't really read once Well, it doesn't work now, but so if you if you're interested to work on that Maybe you should think about being part of the project and not building a standard. I only on top of that Because I mean I'll be way more efficient, wouldn't it out the hdfs with a view to looking at that And then I thought why the last people who know before investing time in it Hello, hi simple question. You said the system is designed with failure Assumed but there seems to be a key dependency on the name node. What happens if the name node fails? You should have planned for fail over in the name node with your standard Was standard measures So currently I'm not aware of anything in the framework itself said Said there is a fail over in the name node, but it's it's sort of easy to Implemented and design it into your cluster. You mentioned Amazon's hadoop on demand I'm guessing they have their own API for that and I don't mean the the actual hadoop API. I mean their Customer-facing API You can write regular MapReduce jumps will see hadoop API and submit some so there's there's no API Amazon didn't implement its own API Basically, you can take for instance mount and run it on elastic MapReduce that has been done about one year ago The only thing that you have to take care of this is that the Amazon API is based on hadoop O.18 and that API differs from what is Currently available There has got to be a web-facing API in front of that, right? Yes, that that is Amazon specific Are there plans for an open implementation of that API the web-facing API and Hadoop on demand web API Is that of any interest? I Think it's currently not on the schedule, but you can ask on the mailing list for input on that Excuse me. Yeah, I have a question is the Does the hadoop framework support backing up my data store in a consistent way and restoring it in any way? It does have back up built in in terms of replication so each file you store to hadoop is Is replicated to three? Discs and in case one of those hard disks fails in case one of those nodes fails Replication is started again to get up to the target level Just take it back up for Maybe today 12 o'clock and to restore it completely If you really want to do backups of petabytes of data Do you have that much storage available? No, if you have zero For for our applications here is their requirements to legal requirements to back to be able to restore at a certain point in time And we looked at hadoop and didn't find a way to do a point in time back up and restore I mean you can always read data from the cluster. No one's stopping you from doing that Um That it's not that I can't freeze my cluster at a certain point in time I can freeze a machine, but not sent I didn't find a way to freeze the entire cluster. I Think that's this is up to you to implement that. Okay, so this has to be done in the application This is this is a job of the application and not of the cluster from your design Thanks a lot So my question is about the fact that Hadop scales on multiple nodes, but how does it handle multiple processors on each node? Of course chaps are distributed on multiple processors But currently I'm not aware of any sort of optimizations in terms of This data I'm outputting in this reducer should be Post-processed on the same node because the node has multiple processors and they can do this in parallel as first I know this optimization isn't in it But of course the framework makes use of multiple cores if you have multiple cores available on each node It's perfectly reasonable Is it in which language adobe is written is it Java or C? but there are bindings to write C++ programs as well and they're streaming and Type API so it's that you can use any scripting language of your choice So if you really want to you can write your data analysis jobs in Python PHP, whatever you want So if I want to Run it on an embedded system. It won't be possible And small router that has adobe inside with let's say 32 megabyte of RAM won't be possible Sounds like a crazy idea There are any more questions, please show If you just use adobe as a file system, do you have any comparative benchmarks with lustre? Do you have any comparative benchmarks with lustre the lustre file system? Should be benchmarks, but I don't have some in my head. You can probably look them up If there are any more questions, feel free to contact me after see talk I I'm happy to answer any questions and the guy over there Raised his hand on it now. Please come to me and talk to me. I want to know more in your use case Me more questions Could you tell me what are the memory requirements for the HDFS nodes? Depends on how much data you want to store in your in your cluster and depends on the size of your blocks. I Don't have exact numbers and generally is it is it a memory intensive task or Which is in memory Well, okay, you said that for the name Server it's required to have a large amount of RAM, but what about the nodes? You mean for the name node No, not the name Name node is a single entity that needs a lot of RAM. Yeah, and what about the nodes? Which are containing the data? They don't need to be set. They are not that much. They're less than on the main memory Main is many is a name node So we have still some time left if left if you have any more questions take your chance Just an announcement because I saw somebody searching for Debian packages at cloud error for Hadoop Hadoop is in the pipe to enter Debian unstable the next weeks And I told people to help you get so started So if you want to head up package into Debian mainline, that's the guy to talk to could at least get up again So people really need to help this guy because I would like to have a Debian package to be able to type up get Installed Hadoop on my regular Debian machine without cloud error repositories without external repositories Just in Debian main. Could you just give him some mic? I need somebody to help me out with the C stuff and all the bindings. I've done the Java stuff This is packaged