 Yeah, hopefully interesting. So hello, welcome everyone. I hope you're enjoying your Python as much as I do. And for the next 45 minutes, you can just sit, relax, and enjoy the talk about big data with Python and Hadoop. Slides are already at slideshare.net, and I'll give you the link at the end of the talk. And this is our agenda for today. At first, a quick introduction about me and my company. So you get an idea about what do we use Hadoop for. Then a few words about big data, Apache Hadoop, and its ecosystem. Next, we'll talk about HDFS and third-party tools that can help us to work with HDFS. After that, we'll briefly discuss MapReduce concepts and talk about how we can use Python with Hadoop. What options do we have? Like, what third-party libraries are out there written in Python, of course, about their pros and cons? Next, we'll briefly discuss a thing called PIG. And finally, we'll see the benchmarks of all the things we've talked about earlier. These are freshly baked benchmarks, which I made a week ago just before coming to EuroPython. And they are actually quite interesting. And of course, conclusions. By the way, can you please raise your hands who knows what Hadoop is working with Hadoop or maybe worked with Hadoop in the past? OK. OK. Thanks. Not too much. All right. This is me. My name is Max. I live in Moscow, Russia. I'm the author of several Python libraries. There's a link to my GitHub page if you're interested. I also give talks on different conferences from time to time and contribute to other Python libraries. I work for the company called Adata. We collect and process online and offline user data to get the idea of users' interests, intentions, demography, and so on. In general, we process more than 70 million users per day. There are more than 2,000 segments in our database, like users who are interested in buying a BMW car or users who like dogs or maybe users who watch porn online. We have partners like Google DBM, Turn App Nexus, and many more. We have quite a big worldwide user coverage, and we process data for more than 1 billion unique users in total. We have one of the biggest user coverage in Russia and Eastern Europe. For example, for Russia, it's about 75% of all users. Having said all that, you can see that we have a lot of data to process, and we can see ourselves a data-driven company or a big data company like some people like to call it now. What exactly is big data? There is actually a great quote by Dan Ariely about big data. Big data is like teenage sex. Everyone talks about it. Nobody really knows how to do it. Everyone thinks everyone else is doing it, so everyone claims they are doing it. Nowadays, actually, big data is mostly a marketing term or a buzzword. Actually, there is even a tendency of arguing how much data is big data, and different people tell different things. In reality, of course, only a few have real big data, like Google or CERN. But to keep it simple for the rest of people, big data can be probably considered big if it doesn't fit into one machine or it can't be processed by one machine or it takes too much time to process by one machine. But the last point, though, can also be a sign of big problems in code and not a big data. Now that we figured out that we probably have a big data problem, we need to solve it somehow. This is where Apache Hadoop comes into play. Apache Hadoop is a framework for distributed processing of large data sets across clusters of computers. It's often used for batch processing, and this is a use case where it really shines. It provides linear scalability, which means that if you have twice as many machines, jobs will run twice as fast. And if you have twice as much data, jobs will run twice as slow. It doesn't require super cool, expensive hardware. It is designed to work on unreliable machines that are expected to fail frequently. It doesn't expect you to have the knowledge of inter-process communication, or threading, RPC, or network programming, and so on. Because parallel execution of the whole cluster is handled for you transparently. Hadoop has a giant ecosystem, which includes a lot of projects that are designed to solve different kinds of problems. And some of them are listed on this slide, more just didn't fit in. HDFS and MapReduce are actually not a part of ecosystem, but a part of Hadoop itself. And we'll talk about them on the next slides. And we'll also discuss PIG, which is a high-level language for parallel data processing using Apache Hadoop. I won't talk about the others, because we simply don't have time for it. So if you are interested, you can Google this for yourself. So HDFS. It stands for Hadoop Distributed File System. It just stores files and folders. It chunks files into blocks. And blocks are scattered randomly all over the place. By default, the block is 64 megabytes, but this is configurable. And it also provides a replication of blocks. By default, three replicas of each block are created, but this is also configurable. HDFS doesn't allow to edit files, only create, read, and delete, because it is very hard to implement an edit functionality in distributed system with replication. So what they did was just, why bother in implementing editing files when we can just make them not editable? Hadoop provides a command line interface to HDFS, but the downside of this that it is implemented in Java, and it needs to spin up a JVM, which takes up from 1 to 3 seconds, before a command can be executed, which is a real pain, especially if you are trying to write some scripts and so on. But thankfully, to great guys as Spotify, there is an alternative called Snakebite. It's an HDFS client written in pure Python. It can be used as a library in your Python script or as a command line client. It communicates with Hadoop via RPC, which makes it amazingly fast, much, much faster than native Hadoop command line interface. And finally, it's a little bit less to type to execute a command, so Python for the win. But there is one problem, though. Snakebite doesn't handle write operations at the moment. So while you are able to make meter operations like moving files, renaming them, you can't write a file to HDFS using Snakebite. But it is still in very active development, so I'm sure this will be implemented at some point. This is an example how Snakebite can be used as a library in Python scripts. It's very easy with just import client, connect to Hadoop, and start working with HDFS. It's really amazingly simple. There is also a thing called Hue. Hue is a web interface to analyze and data with Hadoop. It provides awesome HDFS file browser. This is how it looks like. You can do everything that you can do through native HDFS command line interface using Hue. It also has a job browser, a designer for jobs, so you can develop big scripts and in Bala Hive queries and a lot of more stuff. It supports Zuki, Peruzzi, and many more. I won't go into details about Hue, because, again, we don't have time for this. But this is the tool that you'll love if you don't use it to try it. And by the way, it's made on top of Python and Django. So, again, Python for the win. So now when we know how Hadoop stores its data, we can talk about MapReduce. It's a pretty simple concept. There are mappers and reducers. And you have to code both of them because they're actually doing data processing. What mappers basically do is they load data from HDFS, they transform, filter, or prepare this data somehow, and output a pair of key and value. MapReduce's output then goes to reducers. But before that, some magic happens inside Hadoop and MapReduce's output is grouped by key. This allows you to do stuff like aggregation, counting, searching, and so on in the reducer. So what you get in the reducer is the key and all values for that key. And after all reducers are complete, the output is written to HDFS. So actually, the workflow between mappers and reducers is a little bit more complicated. There is also a shuffle phase, a sort, and sometimes secondary sort. They combine as partitioners and a lot of different other stuff. But we won't discuss that at the moment. It doesn't matter for us. It's perfectly fine to consider that there is just only mappers and reducers and some magic is happening between them. Now let's have a look at the example of MapReduce. We will use the canonical word card example that everybody uses. So we have a text used as an input which consists of three lines. Python is cool, Hadoop is cool, and Java is bad. This text will be processed by, it will be used as an input which consists of three lines. So it will process line by line like this and inside a mapper, line will be split into words. Like this. So for each word in a map function, a map function will return a word and a digit one. And it doesn't matter if we read this word twice or three times or we just output a word and a digit one. Then some magic happens provided by Hadoop and inside the reducer, we get all values for a word grouped by this word. So we just need to sum up these values in the reducer to get the desired output. This may seem unintuitive or complicated at first, but actually it's perfectly fine and when you're just starting to do MapReduce, you have to make your brain think in terms of MapReduce and after you get used to it, it all will become very clear. So this is the final result for our job. Now let's have a look at how our previous word count example will look like in Java. So now you probably understand why you earn so much money when you code in Java because more typing means more money and can you imagine like how much code you should write for a real-world use case using Java. So now after you've been impressed by the simplicity of Java, let's talk about how we can use Python with Hadoop. Hadoop doesn't provide a way to work with Python natively. It uses a thing called Hadoop streaming. The idea behind this streaming thing is that you can supply any executable to Hadoop as a mapper or reducer. It can be standard Unix tools like Cat or Unix or whatever and or Python scripts or Perl scripts or Rubio or PHP or whatever you like. So the executable must read from standard in and write to standard out. This is a code for a mapper and reducer. So mapper is actually very simple. We just read from standard input line by line and we split it into words and output the word in digit one using a tap as a default separator because it's a default Hadoop separator. You can change it if you like. So one of the disadvantages of using streaming directly is this input to the reducer. I mean, it's not grouped by key. It's coming line by line. So you have to figure out the boundaries between key piece by yourself. And this is exactly what we do here in the reducer. We're using a group by and it groups multiple word count pairs by word and it creates an iterator that returns consecutive keys and the group. So the first item is the key and the values, the first item of the value is also the key. So we just filter it, we use an underscore for it and then we cast a value to sum it up. It's pretty awesome compared to how much you have to type in Java but it's still maybe like a little bit more, a bit more complicated because of the manual work in the reducer. This is a comment which sends our MapReduce job to Hadoop via HadoopStreaming and we need to specify a HadoopStreaming jar, and a path to a mapper and reducer using a mapper and reducer arguments and input and output. One interesting thing here is the two file arguments where we specify the path to a mapper and reducer again and we do that to make Hadoop to understand that we wanted to upload these two files to the whole cluster. It's called Hadoop Distributed Cache. It's a place where it stores all files and resources that are needed to run a job and this is a really cool thing because imagine you have a small cluster of four machines and you just wrote a pretty cool job and a script for your job and you used an external library which is not installed on your cluster obviously so if you have four machines you can log into every machine and install this library by hand but what if you have a big cluster like of 100 machines or 1,000 machines they just won't work anymore? Of course you could create some bash script or something that will do the automation for you but why do that if Hadoop already provides a way to do that? So you just specify what you want Hadoop to copy to the whole cluster before the job will start and that's it. And also after the job is completed Hadoop will delete everything and your cluster will be in its initial state again. It's pretty cool. And after our job is complete we get the desired results. So Hadoop streaming is really cool but it requires you to do a little bit of extra work and though it's still much simpler compared to Java we can simplify it even more with the help of different Python frameworks for working with Hadoop. So let's do a quick overview of them. The first one is Dumbo. It was one of the earliest Python frameworks for Hadoop but for some reason it's not developed anymore. There's no support, no downloads so just let's forget about it. There is a Hadoopie or Hadoopie I don't know. It's the same situation as with Dumbo the project seems to be abandoned and there are still some people trying to use it according to PyPI downloads. So if you want you can also try it, I don't. There is a PyDoop. It's a very interesting project while other projects are just rappers around Hadoop streaming. PyDoop is, it uses a thing called Hadooppipes which is basically C++ API to Hadoop and it makes it really fast, we'll see this. There's also a Luigi project. It's also very cool, it was developed at Spotify. It is maintained by Spotify. Its distinguishing feature is that it has an ability to build complex pipelines of jobs and support many different technologies which can be used to run the jobs. And there is also a thing called MRJob. It's the most popular Python framework for working with Hadoop. It was developed by Yelp and it's also cool but there are some things to keep in mind while working with it. So we'll talk about PyDoop, Luigi and MRJob in more details in next slides. So the most popular framework is called MRJob or MRJob or MRJob like some people like to call it. So I also like this. MRJob is a rapper around Hadoop streaming and it is actively developed by Yelp and maintained by Yelp and used inside Yelp. This is how our work count example can be written using MRJob. It's even more compact. So while a rapper looks absolutely the same as in the raw Hadoop streaming, just notice how much typing we saved in the reducer but behind the scenes actually MRJob is doing the same group by aggregation we just saw previously in the Hadoop streaming example. But as I said, there are some things to keep in mind. MRJob uses so-called protocols for data serialization and deserialization between phases and by default it uses a JSON protocol which itself uses Python's JSON library which is kind of a slow and so the first thing you should do is to install simple JSON because it is faster or starting from MRJob 0.5.0 which I think still in development it supports ultra JSON library which is even more faster. How you can specify this ultra JSON protocol and again this is available only starting from 0.5.0. Lower versions use simple JSON which is slower. MRJob also supports a raw protocol which is the fastest protocol available but you have to take care about serialization deserialization by yourself as shown on this slide. So notice how we cast one to string in the mapper and some to string in the reducer. Also with the introduction of ultra JSON in the next version of MRJob I don't think there is a need to use these raw protocols because they are not so much faster actually compared to ultra JSON and at least most of the time of course it depends on the job and so you have to experiment for yourself and see what fits best for you. So MRJob pros and cons in my opinion it has like best documentation compared to other Python frameworks. It has best integration with Amazon's EMR which is Elastic MapReduce and compared to other Python frameworks because Yelp uses, it operates inside EMR so it's understandable. It has very active development, biggest community. It provides really cool local testing without Hadoop which is very convenient while doing development and it also automatically uploads itself to a cluster and it supports multi-stab jobs which means that one job that will start only after the second another one is successfully finished and you can also use bash activities or JAR files or whatever in this multi-stab workflow. The only downside that I can think of is a slow serialization and deserialization compared to raw Python streaming but compared to how much typing it saves you we can probably forgive it for that so this is not really a big con. The next in our list is Luigi. Luigi is also a rapper around Hadoop streaming and it is developed by Spotify. This is how our workout example can be written using Luigi. It is a little bit more verbose compared to Mr. Job because Luigi concentrates mainly on the total workflow and not only on a single job and it also forces you to define your input and output inside a class and not from a common line interface as for the mapper and reducer implementation they are absolutely the same. Four minutes left. I have so much to say. Four minutes. Okay, so Luigi also has this problem with deserialization and you also have to use ultra-json just use ultra-json and everything will be cool. Okay, so we'll probably skip that. It's also cool, Luigi is cool but it's not so good for local testing and we'll also skip PyDoop. Okay, okay. Okay, oh man. All right, all right, okay benchmarks. This is the most important part. This is probably why a lot of people are there for the benchmarks. So this is a cluster and software that I used to do the benchmarks. So the job was a simple work count on a well-known book about a python by Mark Lutz and I multiplied it 10,000 times which gave me 35 gigabytes of data and I also used a combiner between a map and reduce phase. So a combiner is basically a local reducer which just runs after a map phase and it is kind of an optimization. So this is it. This is the table. Java is fastest of course. No surprise here. So it is used as a baseline for performance. All numbers for other frameworks are just gracious relative to Java values. So for example, we have a job runtime for Java like 187 seconds which is three minutes and something. To get the number for PyDoop you multiply 187 by 1.86 which will give you 487 seconds which is almost six minutes. So each job, I ran a job three times and the best time was taken. So let's discuss a few things about this performance comparison. PyDoop is the second after Java because it uses these Hadoop pipes C++ API. It still takes almost twice as slow compared to the native Java but another thing that may seem strange is the 5.97 ratio in the reduced input records. So it looks like the combiner didn't run but there is an explanation to that in PyDoop manual. It says the following. One thing to remember is that the current Hadoop pipes architecture runs the combiner under the hood of the executable run by pipes. So it doesn't update the combiner counters of the general Hadoop framework. So this is why we have this. Then comes PyK. I actually thought that PyK should be the second after Java before I ran these benchmarks but unfortunately I didn't have really time to investigate the reasons so I just can't say why it is slower because PyK translates itself into Java so it should be almost as fast as Java. Then comes raw streaming under Cpy and in PyPy and you probably may be surprised that PyPy... No? Any questions or I just can continue? Okay, okay, so... Yeah, so it's actually... I'm speaking for a half an hour and this is a 45-minute talk so I still have 15 minutes. What is the power for questions? No questions, you see. Okay, so yeah. Cpy and in PyPy. Yeah, you probably may be a bit surprised that PyPy is slower but actually the thing is that the work count is a really simple job and PyPy is currently slower than Cpy when dealing with reading and writing from standard in and standard out so it really depends on the job but in real world use cases PyPy is actually a lot more faster than Cpy so what we usually do we implement a job and then we just run it on PyPy and Cpy and see what's the difference and like I said in most cases PyPy wins so just try for yourself and see what fits best for you then comes Mr. Job and as you see UltraJSON is just a little bit slower than these raw protocols but it saves you the pain of dealing with manual work so just I think use UltraJSON and finally Luigi which is much much slower even with UltraJSON than Mr. Job and I don't want even to talk about this terrible performance using its default serialization scheme so okay if we still have a little like not 15 minutes so I can probably return back okay so we stopped it I think this or this yeah this one so Luigi as we just saw Luigi uses by default it uses a serialization scheme which is really really slow so this is how you can switch to to JSON and I didn't really have time to investigate also but after switching to JSON I needed to specify and encoding by by hand so I don't know it's also something to keep in mind and don't forget to install UltraJSON because by default Luigi falls back to the standard libraries JSON which is slow so okay pros and cons Luigi is the only real framework that concentrates on the workflow in general it provides a central scheduler which has a nice dependency graph of the whole workflow and it records all the tasks and all the history so it can be really useful it is also in very active development and it has a big community not as big as Mr.Job but still very big it also automatically uploads itself to cluster and this is the only framework that has integration with Snakebite which is awesome just believe me it provides not so good local testing compared to Mr.Job because you need to mimic and map and reduce functions by yourself in the run method which is not very convenient and it has the worst serialization and deserialization performance even with UltraJSON so the last of Python frameworks that I want to talk about is PyDup unlike the others it doesn't trap PyDup streaming but uses PyDup pipes it is developed by CRS4 which is a central for advanced studies researching development in Sardinia, Italy and this is an example of a word count in PyDup which looks very similar to Mr.Job but unlike Mr.Job or Luigi you don't need to think about different serialization and deserialization schemes just concentrate on your mappers and reduce this on your code and just do your job so it's cool OK, so pros and cons OK, OK I'll do my best so PyDup has pretty good documentation it can be better but it generally it's very good due to the use of PyDup pipes it is amazingly fast it also has active development and it provides an HDFS API based on LibhDFS library which is cool because it is faster than the native Hadoop HDFS common line client but it is still slower than Snakebyte I didn't benchmark this but Spotify guys claims that it's slower so and it is slower because it still needs to spin up an instance of JVM so I can't believe them that Snakebyte is faster this is the only framework that gives an ability to implement a record reader, record writer, petitioner in pure Python and these are some kind of advanced PyDup concepts so we won't discuss them but the ability to do that is really cool the biggest con is that PyDup is very difficult to install because it is written in C Python and Java so you have to have all the needed dependencies plus you need to correctly set some environmental variables and so on and I saw a lot of posts on Stack Overflow and on other sites where people just got stuck on installation process and probably because of that PyDup has a much smaller community so the only place where you can ask for help is a GitHub repository of PyDup but the authors are really very helpful they are cool guys so yeah the answer to all the questions and so on also PyDup doesn't upload itself to a cluster like other Python frameworks do so you need to do this manually and it's not that not so trivial process if you just starting to work with PyDup so this is it it's a pig pig is an Apache project it is a high level platform for analyzing data it runs on top of Hadoop but it's not limited to Hadoop this is a work on example using pig it will be translated to map and reduce jobs behind the scenes for you and you just have to think about what is my mapper, what is my reducer you just write your pig scripts and also in most of the time in real world use cases pig is faster than Python so this is really cool it is very easy language which you can learn in a day or two or something it provides a lot of functions to work with data to filter it and so on and the biggest thing is that you can extend pig functionality with Python using Python UDFs you can write them in Cpython which gives you access to mollips but it's slower because it runs as a as a separate process and sends and receives data we are streaming and you can also use Jython which is much, much faster because it compiles UDFs to Java and you don't need to leave your JVM to execute a UDF but you don't have access to libraries like NumPy and CyPy and so on so yeah this is an example of pig UDF for for getting a GU data from an IP address using a well known library from MaxMind it may seem complicated first but it's not actually so in the Jython part at first we import stuff some stuff from Java and the library itself then we instantiate the the reader object and define the UDF which is simple and it accept the IP address as the only parameter and then tries to get a country code and see its geo name from a MaxMind database it is also decorated by the pig's output schema decorator and you need to specify the output of the UDF because pig is statically typed and as for the then we put this UDF into the file called GUIP.py and as for the pig part we need to register this UDF first and then we can simply use it as shown like here it's really simple concept when you get used to it there is also a thing called embedded pig this one so we already saw benchmarks so conclusions so for complex workflow organization, job chaining and HDFS manipulation use Luigi and Snakebyte this is the use case where they really shine Snakebyte is the fastest option out there to work with HDFS but you have to fall back to native Hadoop command line interface of course if you need to write something to HDFS but just don't use Luigi for actual map reduce implementation at least until performance problems won't be fixed for writing lightning speed map reduce jobs and if you aren't afraid of difficulties in the beginning use spider and pig this has two fastest options out there except for Java the problem with pig is that it's not Python so you have to learn it it's a new technology to learn but it's worth it and while maybe it is very difficult to start using it because of the problems or installation and so on it is the fastest Python option so it gives you an ability to implement record-reducing writers in Python which is priceless for development local testing or perfect Amazon's EMR integration use Mr.Job it provides the best integration with EMR it also gives you the best local testing development experience compared to other Python frameworks so in the conclusion I would like to say that Python has really, really good integration with Hadoop it provides us with great libraries to work with Hadoop well the speed is not that great of course compared to Java but we love Python not for its speed but for its simplicity and ease of use and so on so today if you are wondering what is the most frequently used word in Mark Lodz's book learning Python without counting things like repositions, conjectures and so on this word was used 3,979 times and this word is of course Python so this is all I got you can find it on Hadoop thank you