 Our next talk, we have a topic Explore Big Data Using Simple Python Code and Cloud Envolvement by Hari Krishna Rava. He's working as a performance engineer lead in Accenture. Hi guys, good evening. My name is Hari Krishna. I'm from Accenture. So working as a performance engineering lead, that is my day job. But I create a lot of Python programs on my own to make improvements in productivity in my project. So as part of that, even I explored Big Data and Python and Cloud using mainly on Python. So I just want to share my knowledge on exploring Big Data using a very simple Python code. It need not be like you're an expert in that and also make use of the Cloud. So it's a simple agenda. I'm just going to follow here. So we'll mainly discuss about, we'll take an example of a Big Data set and try to explore the data and create some problem statement on that and also how we cannot do it in a traditional process and instead of that how we can use a Hadoop MapReduce and how we can also like using Python code how you can write program to process the data and later we will talk about Amazon EMR that is the further abstraction like you can create entire software as a service and just you need to write your program and submit. That's it. You need not worry about entire infrastructure. So and then we will talk, I'll quickly go through the demo and some interesting stuff later. So before that like this is the solution. So instead of talking about problem like I have solution initially like you know. So this is the top ten web pages viewed in Wikipedia English during May month. All right. So you can see something like that. Can you relate that to the month of May? All right. So maybe there's a space flight happened in 2015 something. That may be the reason like a lot of pages on that. And Falcon 9 I think V1.1 the project got kicked off or something like that. And second it's June. In June month also you can just try to relate. Maybe there is a number nine game of thrones. Maybe that is the season five started. All right. It's a number three. So during August month these are the top ten pages viewed in Wikipedia. The first one is the top one is almost like 7 million hits on that nuclear magnetic resource resonance. And also you can just relate to like there is a presidential campaign is happening. That may be the reason like number nine Donald Trump. Right. And this is the September. So maybe you have a PlayStation 3 something happened in September. And you have maybe Google maybe they spin up to Alphabet. That may be the reason like you got it with Google. All right. And one thing you can just try to correlate is that I don't know why. June you have number five deaths in 2015. And August you have deaths in 2015. I don't know what the reason like so many people are hitting on why that's in 2015. Yeah. All right. And in September again you can see that Hillary Clinton. So you can just try to relate to what's happening in the world versus what are the pages viewed in Wikipedia. All right. So how I got all this data. All right. So everyone knows Wikipedia. There is no introduction to that. But think of it's Wikipedia is like a voluntarily created data by across the world. You can see that like almost like 49 million articles created in English and all these things. All right. So image and we thought Wikipedia without a stack workflow or all these things like how search engine should be will be right. So chances are that in your top 10 pages searched in Google or any other search engine chances are that you will have Wikipedia on the top. All right. No. So how I got that. So Wikipedia is like it's like a non governmental non commercial and so non profitable organization created by volunteers with funding from across the public. Right. And so they have a they have a freedom to provide everything open data to everyone. Unlike a commercial they can disclose how many customers they have or how many users they have or how many hits per second. They're getting all these things. All right. So what they do is that they have a website in Wikipedia.org. So where they have they will create hourly log files about what are the pages viewed the page name and the number of requests in that one hour. And how much size of the data downloaded as part of that week one. And if you see here the first one. So the first one is the project that is different languages like the Dutch and English. Yes means Espanyol or for France. Right. So you have a hourly log files. So now imagine like I have a hourly log file size of 100 MB in compressor format. And if you uncompress it, it's almost like a 400 MB. And in a month if let us I want to process the data for extract the data from one month, it will be approximately like 24 hours 31 into 400. That is 300 gig of the data and approximately 5 billion of records. Right. So you can do process in a normal computer or server or something like that. So that requires a different volume. So how traditional process. You can't put it in a large and high speed database. It is expensive. And also here we are just talking about 300 GB. Think about like a teradata or petabytes or something like that. So it's expensive and it can't scale as data increases. Similarly, like file processing on a high CPU like enterprise, world class servers, like a supercomputer, something like that. So they're also like very expensive and not possible for a normal people. And also if something fails in the 99% completed and 1% failed, you need to rerun the job again. So in both the cases, infrastructure is expensive and takes more time to provide the solution. So the solution we have is like, I think everyone knows about how to map it, but I just give a quick overview of what is that. So you have, so what we do is that here, like you split the input file into blocks of 64 MB or higher and distribute to the multiple commodity computers. When you say commodity computers, it's like a normal enterprise. It's not a normal enterprise servers. It may be any normal computers we have like 4GB, 4GB and normal Intel Pentium chip, something like that. And in the same network, like you distribute the, you split the file into the blocks and distribute across the multiple commodity computers in the same network and process the blocks. So you have the blocks of the data in the data, you call it the data nodes, and you will push the tasks. You will send the tasks, processes to execute on the data on the locally. Instead of like normally in a database means that you will face the data and put it in, you process that in your app server. But here it's processing will happen locally, right? And also you are distributing the data across the multiple nodes. So even one, when we're doing a parallel processing across the nodes, even if one node fails, you may need to rerun the job again, right? So you will not have an integrative of the data. So one more solution for this is that you replicate the blocks. Like instead of one copy of the block, you will distribute across, you make two or three copies of that, one copy in one node and another copy in another node. So if one node fails and another node is there, which will process the data, right? So that is how Hadoop MapReduce works. This is called Hadoop MapReduce. So the first one we call as a distribution of the files is called HDFS, Hadoop Distributed File Systems. And the next one is that MapReduce. That is the logic you write to process the data on the blocks and get the output. So this is how it works. So you have different splits, blocks. And on each block, the mapper will run and you will get the output. So let us say you have 64 GB or 64 GB means you have almost like a thousand tasks or thousand splits. And on each split, you will have a map executed and you will get an output. And once you got the output from all the thousand mappers, you will merge them and sort it by a key and give it to the reducer. And reducer will aggregate the data and push into the output. So that is how the parallel processing happens in the Hadoop MapReduce. So you need to write a mapper and reducer, we call as. So the Hadoop MapReduce, normally most of the people written in Java and most of the programs will be written in Java. But due to the popularity or something else like the LO, Python and Ruby, you can write the programs and there is Hadoop Streaming. So that is the API for non-Java programs. So that is exposed and you can just write the program and you make use of the API. So how the API works is that, so you will have, they will expose the blocks as a standard input to the mapper. And then once you get the output, the output is written into the standard output of the Unix and that's how it happens. So now we'll just look into how we written the input. So this problem statement, like how you solve it. So this is the format of your input file, so which is space delimited with four columns with one. So now I need to extract the data in English and I just need only the page name and the number of requests. So column number two and three, I don't require one and four. So that is how the mapper does, like you just filter the data with English and it's a very simple one. So it's only like six lines of the seven lines of the program you've written. And you just filter it out that whether it is a frequency is greater than 100. So why I choose a frequency is greater than 100 is that normally in your top ten you will not have a hundred, definitely more than thousands. So I just filtered more than 100 and project is English. That's what you extract the data and that is the mapper output. Then you have a shuffle. Shuffle will be taken care by the Hadoop mapper. You need not write anything for that. So if you see here, like you need to give the mapper output in a tab delimited key and value pair. So what shuffle does is that it will match all the output and sort it by the key. Here the key is the first column that is the page name and you have it will just sort it out by that. So that is out reducer input. So you can see here like the same for example, you are processing the data from the multiple hourly log files. So each log file multiple log files can have the same keyword. So that's why you see that in the output you have the two instances of a keyword and the two instances of a keyword zip code. So this is how the input to the reducer. Then what you need to do is that you just need to sum up. Like you just read the line by line and the first till you encounter the next keyword, you just read it out and you sum up all the values and you just give it. So that is the reducer. It is a bit complicated than the mapper, but it's easy one. So you just try to get the output. That is you just aggregate by each keyword like what is the total sum. That is the output. So from there you can pick up the data and you can just sort it by your values and you get the output. So that's how your mapper and reducer is written. So the only thing you need to do is that you just need to write your mapper and reducer. Adup mapper reducer will take care of the entire logic of like splitting the files, distributing them and taking care of the tasks execution and ensure that 100% completion of the data job and you get the output. You need not worry about anything else. But still you need to create infrastructure like you have something called master node and core node data nodes. Data nodes are the ones you put the data and you process the data and you have the master node which will keep track of where my files are distributed and also how I am assigning the tasks to each of the data node and keep track of all the tasks are completed or not. So still you need to install the software, you need to configure your master node and core nodes we call as data nodes and that is a very plenty of effort requires even though Adup mapper just provides a lot of emphasis in bolts and nuts still you need to configure it. So for a beginner like me if I want to my main aim is to explore the data rather than not worry about what inside like how much effort required to create the infrastructure for me. So for that you have Amazon cloud. I think everyone is aware of the Amazon cloud. So they have Amazon EMR elastic map reducer so which is the service provided by the Amazon where you just need to define how many master node how many core nodes you require and what type of the instance is in EC2 is the virtual server they call as EC2. So EC2 instance type. So you have a different types of the instances in Amazon. So I'll just show you. So these are the different types of the instances they have. So by each one is having a different CPU, number of CPUs, RAM and hard disk based on that they will charge per hour basis like this. So if I'm just see for example C3 to X large which is approximately around 28 CPU and around 800 GB hard disk SSD and almost like 50 or something like that GB of RAM. So that costs almost like 0.4 dollars per hour. So you just need to give them and the Amazon will take care of configuring the cluster for me. Then next what I need to do is that I just need to provide where is my mapper program where is my reducer program and where is my input file location, folder location and where I want the output location that's it. Even I need not log into that cluster and do anything on that. So it's as simple as like you have a GUI you just need to click and self-explanatory you just give mapper reducer all these things that's it. And that's it. So this is sort of like a theory part now let's go and do quick demo on that. So before going into that in Amazon you have something called S3 that is the place you store your data. It is like an extension to your hard disk. It can act as a hard disk to your virtual server created in your Amazon. So they call it like buckets. So you create your own bucket. It is unique across the world of the users. So I created a bucket called perf test map and I created an input file, input folder and I just downloaded all the data from the Wikipedia. So all the files like I created one instance I created one virtual server like in EC2 that is like virtual server and just downloaded the files to the EC2. It is a free of cost. Like think of like you download 300 GB of data into your local mission. So instead of think of like how much bandwidth it will cost for you. So but here like you can just download free of cost. The bandwidth is completely free downloading anything from the Internet to your Amazon. So I was not, I haven't paid even a single rupee for downloading the data from the Wikipedia. So I downloaded all the files and uncompressed them and push into the S3. So this is my input folder and I created another folder called output and all right this is fine. Thank you. All right. So I created an input folder output folder where I want the output and where I have the scripts. So whatever I showed you the script I just put that in the mapper and reducer. That's it. So now I'll just go to the elastic map reducer. So I'll just do the cluster create cluster and I will just give the names all these things and then so I just need to give what is my configuration. So if you can see here like you have different types of the instance types. So they have compute optimized memory optimized and store is optimized. If you see the different configuration of that I will take a few seconds to load the pricing data. Yeah. So you can see here like T2 micro is the very smallest one where you have the memory of one gig and it costs approximately one rupee per hour versus the top most is almost like C38 likes large which is the T2 CPU 60 GB and 620 GB SSD which costs almost like this much. All right. So I just need to provide the how many number of instances I need for core and then I just need to as I mentioned previously like I just need to select my step as a streaming program configure and add and I just need to give my location where my mapper script is there and where my reducer script is there and where is my input file. So let us say I want input file in September and I just select it and I just need to give the output location. So this S3 will be there even though your server is you have to terminate or something which is available for you. So it is as if like your Google Drive or something. So I just need to create another output September 5, September output. That's it. And I just need to click on add. So once I do the add then what happens is because it takes some time almost like I haven't done that. So it takes almost like eight minutes to provision that hardware for you and configure it, right? Which I already created the cluster for this demo. So I already created a cluster with C3 to X large of six nodes each there are six core roads and one for master node and I just need to add the steps here. So since morning I'm just running a different job one is all the September files, top 10 French pages or top 10 Dutch pages, all these things. So currently all July Dutch job is running. So I can just show the jobs and use the tasks. So which is a bit misleading, but what you can see is that you have almost like 1173 pending tasks. So it will show all the mapper and runner, mapper and reducers were executing here. So it will show most of the data for you. So even if you see here jobs. So I can, this is the one previously, this is the one previously completed for all July French. So I can see the output of that. So if you see this log, which will tell me, which will show you like a quick log of like, you know, it is just executing 78% 79% all these things and it will give you the quick metrics of like how many, how many bytes read, how many bytes written and you can see here like this many input records are processed as part of that almost like 5 billion records processed and these are the output records we got it. So all the metrics it will be provided. So and also like how do map reduce provides a web interface. So what you just need to do is that you do, you need to do the SSH tunneling to the master node, which I already did and this is the how do web connector application. So there you can see what is the current job and what is the current status. So if I see here. So this is the job currently we are executing. So you can see that how many of total and how many are pending, how many are running. You can just see the completed ones and it takes some time to download. So you can see that what exactly it is doing and how much time it took. You can see that each mapper task is executed for 10 seconds approximately. And you can see some counters like when you define a map, there is a by default there is a mappers mapper definition like this much of the memory I need to be allocated to each mapper task. All these are the by default values. You can also change that values and you need not for that you need not log into the cluster and configure it. You just need to change it from their UI only. So these are the different counters and these are whichever whatever I showed you in the sys log and also these are the different configuration parameters. You can see that almost like there are 716 parameters in the Hadoop which you can configure like what is a block size. So one more important thing is that we normally like the block size in Hadoop is more than 64 MB. The reason is that if you the time to seek from the hard disk should be less than the time you took to read the file entire file. You should not have more of overhead on your seek time. So that is the reason we give minimum 64 MB and the block size can increase further. But you can configure like so initially I did a trial and error like I gave a different types of the nodes, different types and how I will see that is it like memory intensive or it is like a CPU intensive. All these things you just need to monitor trial and error and come up with a optimized number of nodes and what type of the instance type I need to have. Is it like how much memory I need to have and what is what should be my memory per each mapper task. All these things we need to configure it. So that is about the demo. So that is how I got the output. So here you can see that I have not done any configuration or anything. I just need to write a mapper and reducer and I used my local Python to test it against that. So I tested against the local Python at the mapper and reducer and put it there. And I need to log into animation and do the configuration of anything else. So that is the beauty of this. So that's it on the demo. Now this is the one I learned in one of the online course. There is something called Simpsons Paradox. So what exactly tells is that like, you know, so depends upon how deep you're looking into the data, the insights of the data will change. So because now going forward, we will be exploring in your day and day job, we will be dealing with most of the data, like a huge data, and how to interpret the data is very important. So this is called Simpsons Paradox. If you see the example here, I think everyone can see here. There's a university where you have two major courses, major A and major B, and you have people applied from different gender. And if you see major A and major B, you can see that acceptance rate is higher for one gender versus another gender. But if you combine both of them, right, if you just look into like combine both major A and major B and project your acceptance rate, my acceptance rate will reverse it, right? So now you can see that the different gender have the higher acceptance rate than the other gender. So this is called Simpsons Paradox. So that should be very important if you know, looking into the data in your day-to-day job, right? So using, this is one of the information I extracted from the Wikipedia, but there are n number of insights. You can get it from Wikipedia because it is more of accurate information representing what is the users currently trying to do it in the internet, and using that, you can make use of a lot of applications on top of that and make use of that as the inputs. Other than that, there is one more community-driven. So the Wikipedia is also like a community-driven website. Similarly, like there is another community-driven big data set we have. I don't know whether everyone knows about this. There is something called IndiaRileInfo.com. So it is also like a community-driven data set. So, okay, I think you can see that. So I'm trying to search. Community-driven, even we don't know, like, who is the founder for that website? So it gives, like, you just give the source and destination. Like, I want to go from Bangalore to Chennai. It will give you the different trains with start time and arrival and departure time with what is the average delay of arrival and departure. And also, you have the community users, like volunteers. They will create other paths. For example, like, there is a train which goes directly from Bangalore to Chennai. That is what you will see in the top two records. It's actually the Bangalore to Pondicherry. But the other ones are created by the community. Say that, okay, you can go from Bangalore to some Madurai and from Madurai to another one with the delay of this much. All these things are created by the community, okay? It's a very big, huge data set. If people want to explore it, it's very, very accurate information and you can get a lot of insights on that, okay? So I'll just give you one example here. So I'll just try to give the difference between that community-driven website versus a commercial travel aggregator. So think of, like, I'm going from Chennai to Bangalore. So does it make sense to go through by a Mumbai with a wait time of 18 hours? So that's what your commercial travel aggregator will provide, which is making no sense, right? So when you try to search for Bangalore, Chennai to Bangalore, it doesn't make sense, sorry, Bangalore to Chennai. It doesn't make sense to go travel to Mumbai and from there wait overnight and come back in another flight. See, it's almost like 18 hours plus almost like three hours. You're taking 20 hours to travel from one place to another place which hardly takes six hours even by trying, right? That is a normal commercial travel aggregator which is mainly depends upon, like, you do the data mining and provide the information. Whereas a community website, like indianrail.info.com, you can see that the maximum wait time is almost, like, less than four hours when I'm trying to get from Bangalore to, sorry, which one, Pondicherry, right? So if you can see the last but column, that is the users, the community who created, the user created that row of the data, right? Yeah, okay, sure, sure, sure. So that's it, okay, I'm done, sorry. So I'm done with my presentation. So these are the tools I used it and you can get a, there's a GitHub link which this link is already available in your Python, so PyCon proposal, you can get it and you can get all the step-by-steps right from creating the account in Amazon, AWS and right from installing the Python, IPython, all the tools on all the steps are available there. You can also explore big data using that. So that's it from my side. So if anyone have any questions, please let me know. Any questions, yeah. So you said the English data, there's 300 GB of data, right? Sorry? For English data from Wikipedia. Exactly. How long did it take to process the entire 300 GB? Even now it is running. So it took almost like it depends upon how many nodes you give. So due to some other budget constraints and other things, I just gave six nodes because it is scalable. So if I gave only six nodes of this type, so it took hardly 40 minutes, 46 minutes, something like that and Hadoop Map reduces linearly scalable. If I give almost like instead of six nodes, you can give 12 nodes, it can process in 20 minutes. At the same cost, right? No. More machines, lesser time. Yeah, actually Amazon asks you to pay per hour, whereas maybe I hit the other cloud guys are giving per second also. Next question. Anyone? Questions? Hi. Third row here. Yeah. My question is how well does Python work when using Hadoop when compared to Java? Yeah, actually even I did some research on this, but I know I don't know the Java and all, but for a normal text processing and all, it is comparatively okay, but it's performance wise it is on the downside. But for me the car doesn't matter for me as I like, but still you can do a lot of tuning on this part. I need to still work on that. Thank you for the session and I have one question here. Yeah. Somewhere in my organization, I have seen people using the live Twitter feeds and created beautiful websites or apps using the AngularJS or something. So is it also, does it use the MapReduce to process the Twitter live feeds or how is it? I don't know much about that, but what I feel is that they call it like a spark storm and all these things. The main fundamental concepts is still you have like Hadoop MapReduce, that's what I feel, but I'm not the expert to comment on that. Thank you. And still I'm learner. Any questions? Questions, guys? Okay. So that's it. All right. Thanks very much. And you have odacity.com like most of your data science courses and all the things you can learn in odacity.com. Most of that website tutorials are based on Python, whether it is analytics, whether it is like artificial intelligence or robotics, mostly they will give a demo in Python. Okay, small announcement. At 4.45, we are having a feedback session. So guys, you can stay and give all your feedback. How was it? And till that time, we are having a lightning talk. If anyone wants to give, please come ahead. If anyone wants to give, please come ahead. Come on, guys. For this, a lightning talk, we won't be having the projector setup. You just need to come here and talk. That's it. For three minutes. I'm here to talk. Good evening. Good evening. So exhausted. Energy guys, energy. Good evening. Chai naipi. Chai mili? Okay. So I'm here to just share an experience and the... I've got the points over here and you don't give me slides. Okay. So why did you guys... The talk is about how to get the most out of PyCon and how to have a good time over here. So most of you came here with an expectation of getting to attend very great talks and having awesome, like, you know, getting to learn, I don't know, everything that is there to learn in just three or four, just two days within two hour sessions, probably get an expertise in Hadoop and maybe learn everything that is there to know about metaprogramming. To be very fair, that's quite impossible. You'll get a basic idea, but not the whole deal. That requires a lot of effort. So my strategy in attending PyCon and, you know, getting to know and getting to, you know, having a good time over here is to follow a couple of, you know, mantras. What I try to do is interact with people as much as I can, make friends, just bump into random strangers and say, hello, hey, who are you? What do you do? I mean, just, you might just make very good friends. You might get to know about people who are doing great work in their own offices and are changing the world probably. And, you know, that could be a very good experience. Other than that, knowing people and getting known yourself can help you not only solve your problems in personal life as in the personal technical problems that you face at your job, but also in, you know, getting job opportunities. People are looking for good talent every time, not just the sponsors, but almost everyone who is working in any job is looking for good talent to get into their team. I know I am, and I mean, that's a universal problem. Get good talent. So if people know you, that you're great. You'll land in multiple job offers. You'll be drowning in job offers if you're well known. So, and apart from the hiring thing, having a good time and being a part of the initiative is a golden opportunity you get out of Picon because it's a voluntary initiative. So, you have an opportunity to be a part of the organizing team. You can help out in any which way which you can. Like, you think you can help out with the network before cursing out loud. Come along, we will welcome you if you can help us. Out your asses admin, you know how to set up routers, you know how to lay out cables, come along, help us. You think the talk selection was bad, pathetic, boring talks, we were selected this time. Be a part of the talk selection committee next time. It's people like you who are doing it. So, well, why not you? Just be a part of it. And, well, in any which way. So, well, hopefully you had a good time. This year and you'll be a better, you will be involved more yourself, yourself more in the next year, and we'll make, this video was good. I'm pretty sure we'll have a better Picon next year as well. Thank you. Okay, the next talk. Quick, three-minute talk. Please. Hello, my name is Wamsi. Yes, my name is Wamsi. There are many talks and last week we had a talk with Wamsi. We had a talk with Wamsi. There are many talks in natural language processing. But I wanted to keep it more simple. We are working on all basic stuff. So, I would like to share my experience about natural language processing. A few days back, we were working on a project called News Aggregator. The thing is that we were taking XML feeds from various sites like Hindu, Deccan Chronicle, Times of India, various things. You keep on. So, then after taking the XML feeds, we are just filtering out of all the titles out of the XML feeds using Beautiful Soup 4. And the Beautiful Soup is an extraordinary library for filtering out all those things. And after that, we were in a state of filtering all those things. We want to stack similar items at one spot. So, further, we have been thinking about various libraries which are available. At that stage, I went along with NLTK. So, okay. There is a problem with the projector. So, I took the NLTK. We were on NLTK. And we got segmentation, tokenizing, and POS tagging of the title, only title. The projector is not working. I can't show you that. And the second stage was we used a naive-based classifier for stacking up items. And we have a couple of even other libraries called Text Blob. Even inside our engine. But the only thing I thought of showing you that, the only thing in the production when we want to release that is NLTK is too slow. It's almost taking 12 seconds of time to just import that library. So, for that reason, I would also recommend libraries like Pattern, Text Blob and Genesim. Even Genesim is used for similar sentence recognition. So, that's for it. So, who aren't with NLTK, they can also use other libraries called Text Blob. And Text Blob is on the shoulders of both NLTK and Pattern. Go also try Text Blob. That's it. Thank you. Thank you.