 So I think here we have with us for the first talk is Jyoteshka. He is a commuter at Apache Spark project. Currently he works as data engineer at Data Weave. The topic of the presentation is Python plus Spark, Lightning, Fast, Cluster Computing. Please welcome Jyoteshka. OK. Thanks, guys. Thanks, guys, for coming here. Am I audible? This one. Now, am I audible? This time. Can you guys hear me? Hello. Hello. Hello. Thanks, guys, for coming. So this is my first PyCon and the first PyCon talk that I'm giving. So the talk is on Python plus Spark, a way to do Lightning, Fast, Cluster Computing. Sorry. Also, at the end of the talk, we'll have five minutes QA session. Please raise your hands. One of the volunteers will reach out to you, and then you can ask your question. Thank you. Hello. Hello. Hello. Yeah, I think you can hear me now. Yeah. Hi, everyone. Thanks for coming here. I am Jyoteshka. I work as a data engineer in a company called Data Weave. I'm also a committed disfark project. So the topic's talk is Python and the same. So the topic of this talk is Python and Spark, which is a way of doing a pre-computing at a Lightning, Fast, Place. So Spark is actually a big word that's going on. So it's some people caught the Hadoop killer. So before I go on, let's do a show of hands first. People who are familiar with Hadoop MapReduce, just raise your hand first. All right. All right. That's a good number. And people who have heard of Spark before coming to this presentation. Great. Great. So if I had to describe Spark in one sentence, that would be it's an in-memory cluster computation framework for large-scale data processing. See, I have emphasized on the word in memory because that's what makes it different from traditional Hadoop MapReduce paradigm because Spark leverages the memory that you have in your server or in your cluster. And you can load data in your memory and do iterative processing there. So with how times are going, memories are becoming cheaper and cheaper. Now servers are standard with 16 GB of RAM. And if you have 10 of those servers in your cluster, that's kind of a lot of RAM for doing processing. So the paradigm had to shift from doing a traditional batch processing that currently Hadoop does and going more towards in-memory computation. So some stuff about Spark, it was in 2009 in UC Berkeley. It was started as an academic research project, but it gained traction and it became more popular and popular. So it went into Apache Incubator last year and earlier this year it was graduated from the Apache Incubator and it's one of the most popular projects on Apache right now. So when I made this slide, there were around 290 contributors on GitHub. And last I checked, there are 307. So within less than a year, the number of contributors has gone really high and this has become one of the most popular Apache projects all time. So Spark code is developed using Scala, which is a functional programming language based on JVM. And it also comes with Java and Python APIs. And also, so contrary to popular beliefs, Spark is not meant to replace your Hadoop. So if you have a Hadoop cluster, you can just install Spark there. Spark can go there and sit quietly and help you with data processing. So a lot of distributions right now come with Hadoop pre-installed. So popular distributions such as Cloud Era or Hortonworks or MAPR now bundling Spark and the other softwares that come with the stack with their distribution right now. So if you download latest Cloud Era distribution, CDH or Hortonwork distribution, you will get Spark and you can play with it. And also there has been a lot of benchmarking on Spark. So one of the benchmarks shows that it can be up to 100 times faster than traditional Hadoop MapReduce. And when you are processing in memory and if you are using this, then it can be up to 10 times faster. So who are using Spark? So this is a limited list. There is a full list in their website. And if you can see that all the big guys are using Spark right now. There is Alibaba, there is Amazon, Yahoo is there, eBay is there. A lot of companies are currently where experimenting with Spark last year and they have currently moved Spark to their production. And one of the companies, Databricks, this company was founded by the people who actually created Spark. So they are building a Cloud solution using Spark. So here are the conceptions that people have around Spark. So you need to know Scala or Java to use Spark. Spark is built using Scala. That doesn't mean that you have to know to use it. That's a false misconception. There is a pretty good and pretty stable Python API available and all the Python functions which are there for Scala or Java, they work in Python also without any latency. One more misconception is that there are not enough documentation or example codes available for you to get started on Spark. That's not also true. So whatever example that's available for Scala or Java in their website or in their GitHub repository, the same example is also available in Python. Some of them I have put there myself and there are a lot of example codes available. There are a lot of blog posts available. There are a lot of videos available. There has been two Spark summits in one in 2013 and one in 2014. So there are a lot of videos which shows that how people are using Spark in production. Some of the Indian companies, a lot of the used companies are using Spark right now. So that's also false. One of the misconceptions is not all Spark features are available for Python. I'll say that it's false because we have a machine learning library, MLlib. We have full SQL support using a shark but that has been integrated into the Spark ecosystem and now it's called Spark SQL. This thing is Spark streaming. So how do you process streaming data which are coming real-time and you do some real-time computation. That's available in Scala and Java right now but the current version, the stable version is 1.1. In 1.2 we are releasing Spark streaming in Python. So that's a work in progress. All right, so let's get started with PySpark. So what's PySpark? PySpark is a Python API for Spark that's been written using Py4j. Py4j is a wrapper which can talk to JVM and it provides an interactive shell for you. So I'll show you in a demo that when you run PySpark you can either write a standalone program or you can process data from the shell itself. And also the good thing is that you can integrate with your iPython shell or iPython notebook and you can do interactive processing with it. And also writing code in PySpark will lead to two times to ten times less code than standalone programs. That two times is for Scala program and ten times is for the Java program. So currently it has full support for Spark SQL that was previously known as shark and I've mentioned that Spark streaming is coming in. Who can benefit from PySpark? So Spark has a lot of use cases. Many people are building their data pipeline using Spark right now. So architectures like Lambda Architects and all which are quite popular right now. So many people are actually including Spark for their real-time computation or batch processing or things like that. But especially for PySpark, so there are certain class of people who can greatly benefit from it. So they are called data scientists. So data scientist is one of the buzzwords these days. So PySpark has a very good machine learning library that's MLlib that can also leverage the in-memory cluster computation and iterative data processing. There are a lot of statistics libraries available. There are almost all ML libraries available for classification, clustering, recommendation, regression and all these things. Also a good thing is that in your Python program you can integrate Spark with existing packages like NumPy or Matplotlib or Pandas event for data dangling and visualizing your data. So let's say you have to process... Hello? So let's say you have a 40 GB Wikipedia data set that you want to process and you want to do sub-second queries. So what you do is you load the data into memory, you do some processing, you do some data cleanup, data dangling using Spark and use your traditional Matplotlib or Pandas to visualize your data. So it's as simple as that. And also for... so machine learning programs tend to be iterative in nature so you have to load the data and you have to do multiple iterations on it. So that was difficult in traditional MapReduce paradigm so I'll show it to you later. But since the data is already in the memory, you have cached it in the memory so you can reuse the data again and again without having to load it back and forth from your disk to memory and memory to disk. So this is interesting. So how Spark differs from traditional Hadoop MapReduce paradigm. So this is a very common picture that you have seen. This is a MapReduce architecture. So you load the data from the disk, it goes into a mapper, then it goes into a shuffler, then you do reduce and then you write whatever you have processed back to disk. And in between you keep reading and writing from the disk because the idea for MapReduce was that the data should not come to code, code should come to data. That's why you do a lot of writing back and forth from disk. And it's a programming paradigm for batch processing and by the way you achieve fault tolerance is you replicate the data in n number of data nodes and also for if you want to use some high-level frameworks like Pig or Hive if you want to process the data, then it will run one separate MapReduce program for each query that you process. So what's different in Spark? The difference is when you load the data once in RAM, you load the data into memory and you keep it there until you are done with it. So you don't have to write it back to disk again or if you want to do any kind of operations, map, filter, reduce anything. You don't have to write it back to disk. You can keep it in memory as long as you like and you can process the data in any way. And the good thing is, so if the data is too... So you might say that let's say I have 16 GB of RAM and I have to process 100 GB of data. What happens then? So it will fit whatever it can into memory. This will be spilled into disk. So whenever it has to compute with the whole data, it will take it back from the disk and it will work like a normal MapReduce. Also, like I mentioned that it can be integrated with IPython shell or IPython notebook or standard Python shell. So you can do interactive processing. So in your shell, you can process data. You can see the output right there. You put it into some other collection. You run some other computation on it. It's become interactive. And also, the dataset is represented as RDD. RDD is a data abstraction in Spark. I'll explain. And the RDD actually takes care of the fault tolerance using something called a lineage graph. So what's RDD? RDD stands for Resilient Distributed Dataset. So it's a read-only collection of objects and across a set of machines. So when you load your data into memory, in a Spark context, it actually becomes an RDD. And RDD is automatically split into multiple partitions. And if your cluster has more than one node, then it will be sent to multiple nodes for parallel processing. And good thing about RDD is since you know how RDD was computed, you can actually rebuild it. If one chunk of data is lost, let's say one node goes down. What do you do? You can recompute the data back without losing any of the information. And also, RDDs can be cached in memory. So by default it doesn't. So when you're done with it, it discards from the memory to save space. But if you want to persist in the memory, you can do that. You can cache it. And you can use multiple and app-reduced parallel operations. And I'll explain why RDDs are called lazy. Because once you load the RDD, it doesn't do any kind of computation. It just stays there. It will start computation once you do some processing on it. Once you do some transformation or action on it. Otherwise, it will just stay there. So here is an example. So you load one text file from HDFS. And this SC is a Spark context. And it gets stored in the lines. So lines becomes an RDD. And you can do multiple operations on it. So you can do a flat map on it. Then after the flat map is done, you can do a map on it. After the map is done, then you can do a sort by key on it. And you get the sorted count. So what happens in the diagram below? So what happens is you have an HDFS file. So you apply the flat map function on it. It becomes a flat map RDD. Then you apply a map on it. It becomes a map RDD. And you do sort by key. It becomes a sorted RDD. Let's say this map RDD goes away. The node of which contains that map RDD goes down. What happens? So you know how the map RDD was computed. So it was computed using that map, that lambda function from the flat map RDD. So Spark can recompute this data without having you to load it back or using replications. Because you can't really replicate in memory. Memory is limited. And also memory is kind of more expensive than disk. You can't keep adding terabytes of memory in your cluster. So it won't replicate. But it has a capability to recompute the data. So what are some of the operations you can perform on an RDD? There are two kinds of operations. One is a transformation. And the second one is an action. So a transformation means that one RDD gets transformed into another RDD. So you can perform actions like map or filter or a sort or a flat map. And action means that you are done with your RDD and you want to write it back to disk or you want to get some data or something on it from out of it. So you do an action. So some of the actions are reduced or count or collect or save to your local disk. So I'll show these examples one by one. So first one is a map. So this is how you define, I mean create a Spark context. This was in the Spark older versions. So new examples I'll show in the live demo. So the idea is simple. So you create an RDD by using the parallelist function on the Spark context. What it will do is it will load those three strings into the RDD. And it will apply that map lambda function. So what you do is you do the count. I mean it's a simple example. So you count the length of the string and you emit it back. So that collect function is an action. So it actually gives you back the data to the SDD out from an RDD. Second one is a flat map. So flat map is like... So next one is filter. Filter is simple. If you want to, from an RDD, if you want to apply a filter function to filter some of the data back, so you apply the filter. So I have a list of integers. And I want to get the only the integers which are even. So I use a lambda function where I do a module on it and collect the data only where that lambda function satisfies. As simple as that. So this is an example of an action on you do an RDD. So you do reduce. So let's say you want to do a sum of 1 million integers. You create the list which is stored in the num underscore list. And then you perform the reduce operation on that RDD and you use the operator add. So it gives you back the total. Also you can do similarly you can count. Suppose you want to count how many words are there in a text file. So what you do is you split the lines into different words. Then you flatten it. And as simple as that you count the number of number of elements are there in that RDD. And also at the end you can save the data back to your HDFS audio local computer. So what you do is you load that you load the whatever file is there in your HDFS in as an RDD. You split it and you do you flatten it and you use a save as text file. And you mention the output directory where you want to save the text file. So let me show you some demos actually that will actually make things clear. So is this visible? Should I increase the font size? OK. Is this fine? OK. Is this fine? Yeah. So this is how you showed I mean start a demo. You start with word count always. So what I do is from in line 4 I import the Spark context from PySpark. And I create a Spark context with an app name that's called word count demo. So you name your app. And then in the next line what you do is so you supply one text file that you have in your local machine and you load it as an RDD. And then you do a filter on it. So that filter works as you want to remove the all non-empty lines from the text file. I mean all the blank lines. And after that for all the lines which are not empty you flatten it by splitting them into small strings. And then you perform I don't think wait. OK. So in line 17 you first perform a map on it because I want to compute the top five words according to the frequency. You perform a map on it. You emit it as each word as the word and a value one. And then you reduce by the key where the key becomes the frequency and the word is as a second parameter. And see that I have done a sort by key. So this will since the frequency is the key now. So it will sort based on the key and that false means that it will do in a descending order. If you don't mention anything it would automatically assume that it's true which means that it will sort by ascending order. That's why you mentioned false. And at the end you do a word count take. Take is also an action. It will take top five the integer you mentioned in the parenthesis. It will take that those many numbers of elements and show you two in your study out. OK. So how do you do this? I guess it's fine. So you run it as PySpark. And one sample text file I've gotten from Project Gutenberg. So it will show a bunch of status messages and it's starting some A3D file server and it's computing. And it will show you the output in the A3D output. OK. So here is the output. So it shows a job. Here it shows that the job is finished and the job took two seconds. That's OK, right? Because it's a local machine. And it shows that the top five words and the frequencies also. Because I did not remove the stop words it will show the stop words. But you get the idea, right? OK. Also I would like to... So I ran it using PySpark. So PySpark is not really PySpark. So there is a script called spark submit. So when you are submitting a job to spark you use this script. If you are running any Python script or JavaScript or Scala script you use this spark submit to submit the job back to spark. And I have just created an alias so that it becomes... I don't have to type it every time. All right. So next example. So this is a simple log processing. Suppose you have a big bunch of HTTP logs. OK. So one thing I should mention is the examples I'm showing it might seem really a simple example because since I'm running on local machine I'm not really testing it with a lot of data. But the good thing about this is you can actually... So if I'm running on 1MB file or 10MB file or 100MB file and I'm using this script if you have 1TB file or 10TB file you can use the same script and you can run it on your cluster and it will depend on how powerfully your cluster is. It doesn't change anything. The code remains same. So all the actions or transformations you apply in RDD no matter how big your data is you just have to... It depends on how big your cluster is and how powerful it is. OK. So second example is simple log processing. I'll just go through it quickly. So again you create an app that's a log processing demo then you get the files from the... as a command and argument and then you find out how many post requests have been made. So you filter the lines which contains post... in caps... post in it and you also find out how many 200 requests has been made how many valid responses has been so you also filter out the lines which have 200 in it and then you print it in the std out I'll just... I'll just quickly show it. So... Alright. The job is finished. OK. So here it shows that number of post requests are the number which is given. Interesting thing is... see this So... Here you are doing two operations on... on this RDD. So first is finding the number of post requests and then the second is finding the number of valid responses. So... So if you see here that it gives you the number of post requests has been made then it triggers another transformation and action on it. Then it's somewhere here. It gives you back... OK. Here. And here it gives you back the number of how many valid responses have been. OK. I'll quickly go through the third example. So this is an example for logistic regression. So it's one of the popular machine learning techniques for classification. So this example, it deals with binomial logistic regression or the multinomial one. So... Wait. OK. So... You... You can... So as I was telling you that you can use NumPy or any of these existing Python packages available. So we are using NumPy here because it's fast. We are importing Spark context from PySpark as usual. And also we are using... We are importing something called LabelPoint because you have to label each of these points. We are using a random dataset and there is something called LabelPoint for labeling all these... each of those feature vectors. And also we are importing something called logistic regression with gradient descent. That's available under PySpark.ml.classification. So see... LabelPoint is under regression and there is one under classification. There are similar packages for clustering. There are similar packages for recommendation. So what you do first do is you pass the point. So you have a text file and so the text file looks something like this. It's messy. But what you do is you first clean up the data you pass and you also use... and you get the feature vectors for each point and you return them as a labelPoint object. After that you start your Spark context and then you take the number of iterations also and then you create the model by training those feature vectors with this train method that's available under logistic regression with SGD with the number of iterations that's been given as default. So by default the number of iterations is 100 but in the demo I'll give something lesser than that. And it will finally give you the hyperplane weights and intercept value. So where was the numpy code of this example? So it has a dependency on numpy because if you want to do some MLA and so we have kept numpy as there to do faster processing. Numpy is just a dependency. You don't have to import it. If this code had this example at numpy code would it just walk out of the box? No. Sorry I didn't get you what? I'll get back to you on this. Okay. I'll need just one minute. So the final example is a quite interesting one. How to do SQL where is with with PySpark. So you import we import a spark context and now you import something called as SQL context from pyspark.sql So you create a demo and there is a simple JSON file called JSON. It looks like this. It has name and their age and what you do is so here in this line 16 you do the print schema. So it will it will find out what the schema is and it will print it and it will print on the sd out and also you next what you do is you register it as a table. So you give the name as people you can give something else also and then you perform this query select name from people where age greater than equals to 13 age less than equals to 19 just like you do SQL queries as simple as that and it will give those name of the people who are teenagers. So let's let's run this. Yeah, almost done. So if you can see here it is printed the schema as we use a print schema function on it. So there is one age which is integer. It has info automatically and also name which is string and this can be nullable because we did not mention it explicitly and at the end it has printed this number of the sorry. So it has printed the name of the people who matches those the SQL criteria we have given so it has printed on the sd out. So it's kind of difficult to pack all these things into us and half an hour talk session so there are more interesting things you can do with Spark and special with PySpark. So let me get back to the slides. So if you want to contribute to PySpark, so you can submit a poll request on GitHub. There has been a lot of contributions in last six months and we'll expect that you also contribute back to the project and also if you use and you find a bug you can report on the Jira and you can also join the purchase part mailing list and these are my contact details. That's all. Yeah, thanks. Any questions? I'll put it on my GitHub repository and I'll also submit the link on that final proposal I've given in the comments. I'll give it to that. Yeah, sure. Just pick up. I can hear you. I can repeat the question if you want to the audience. So most of the stuff you were talking about like regression. Yeah. So those really depend on a lot of iterative aspects. Yes. Yeah, but see it what happens in it actually goes back to the Scala program that's given. So it uses a gateway and it actually calls that Scala code that's written and it passes all the parameters there and gets back the output and so I'm not really sure if OpenCL is compatible with Scala and those all this AVM framework. Is it because OpenCL OpenCL is actually pretty portable. It's portable OpenCL too. Okay. So and also there's something like CUDA for GPU and CUDA also has a Java back in. Yeah. JQDA. So I was just thinking like this stuff has to deal with so much like parallel stuff. So why don't we have a CUDA back in so parallel stuff becomes so much easier. That's possible. So also what we can one of the features you can deploy on your processor. Yeah. And if you want to leverage the existence of GPU in your cluster I don't know if that someone has started the work but it will be definitely interesting but it's not there because it's just a year old project. Yes I understand but I'm just curious and I just want I want to see OpenCL and CUDA support in this. Yeah definitely definitely if it's possible then. So how does this integrate with this? Yeah how does Spark integrate with NumPy? So all these machine learning functions it uses NumPy as back in so it's one of the dependency. So it's a dependency only in building the MLib module right? Yes if you're compiling PySpark I mean from the source you need to have NumPy in your machine. Okay what about SciPy or some other something else in the stack which also depends on NumPy like pandas. Now I don't think MLib depends on pandas it only depends on NumPy but pandas also depends on NumPy so in a similar way we can also integrate pandas. You can mash them up in your code and you can use it. Okay so basically having any NumPy dependent code in your in whatever file you submit in whatever job you submit will just work as expected. If you have a NumPy installed that's all. Okay okay thank you. Sorry. How does that work? Okay. So streaming is not really supported. If you're asking about streaming it's not yet supported in PySpark but we are bringing it in the next iteration in the next version that's being released but it's it's there for Java and Scala. That's it. So guys if you have any question I'll be around just