 Welcome back to the next session. I'd like to introduce Kripesh Desai who's going to be talking to us about big data Kripesh is a data science enthusiast. I get that right of sorts Please welcome Hello, everyone. I'm Kripesh and I Worked for Nelson Marlboro district health board. I've worked there as a data migration and business intelligence lead Today I'm going to talk about big data and Python so I see myself as a data engineer and I was introduced to big data in 2013 when I was asked to replicate data from Oracle to Hadoop The scope of the project was limited. We sort of did that but then I wanted to learn more I was really I was really excited about the scope of big data what it can do So what I did I packed my bag left Auckland and come to Nelson because I can save some time in commuting and I give that time to my study big data. That's like two hour a day, which is good enough so So yeah, and Nelson is good. So it's been like two three years now since I'm Learning and practicing big data. I don't claim I'm an expert But what I want to do today is just to summarize my understanding. So if I'm wrong at least someone can then guide me properly Okay, so let's start So obviously I'm going to start with big data where I will try to see what exactly big data is and I will end with data science because Big data and data science are like two sides of a coin and in that discussion We will look at map reduce Hadoop and cloud computing and obviously we're going to look at Python Especially in terms of a preferred choice of language to build big data solution So What is big data? I Believe it's a very confusing term because the first question it raises is how big is big data, right and There is no precise answer to that In 2014 I found a really good answer to what is big data And I believe the answer is still relevant and the answer is that big data is like teenage sex Everyone talks about it and nobody really know how to do it But everyone thinks that everyone else is doing it. So they claim that they are also doing it So That's the side by this gentleman here not me So don't put any any kind of offense on me So let's let's look at what is big data now To understand big data, we have to first understand what is data growth until 2003 we have produced five hexabyte of data and now we produce five hexabyte of data every two days and Where this data coming from YouTube? That is like few example here first is YouTube the video that we upload total videos We upload in one hour that combined length is 72 hour So 72 hours of video being uploaded every hour every day Twitter 8 terabyte of tweet every day Facebook 10 petabyte of data, you know what we feeling and you know What what like what photos and what we feeling what not what like everything is on Facebook And that's around 10 petabyte per month and it's increasing Internet archives that is you know, we have different web pages blogs websites Not a lot of new web pages being created with generate Internet archive and that is around 20 petabyte per month We also have digital footprint, which is when we watch a movie online or when we read a book using some app it it Generate digital footprint around what kind of movie we watch or what kind of book we read How many pages we read in one go? So that's digital footprint. We also have sensors. So along with RFID sensors There are web sensors, which is producing data around click streams for example Which which ad you watch or which ad you skip or do you really buy a product online? Or just go to the payment page and close the browser People do that right so that's I do that so So that's that's your click stream. So that's generated by web sensor so in a way we are this you know, we are producing this humongous amount of data every day and thought the first thought or the philosophy or the idea whatever we want to call it the thought to Process this huge amount of data and create some value out of it That thought gave the birth to the term big data now And who else it could be it was Google who pioneered the big data solution because being Google You have to first create an index of entire web Update the index on regular basis and then have a search on top of that index So it's a big task and the solution that they created was called Google distributed file system and they released the paper of that Yahoo picked up, you know make it more You know improve it to fulfill their requirement and they Made it it was an open-source project so then they released two product called MapReduce and Hadoop and Then the whole buzz around big data started in that 2005-6 time and all the big enterprises wanted to join in as well. So So big data is not just structural data. That's the first thing the structure data is our Normal transactional data that is rows and column table field, you know relational data old tp data So big that transaction data or structure data is a part of big data But big data is not just structure data. It is also semi structure data and unstructured data semi structure data is like Example Jason or XML the semi structure. It has a some sort of a structure Unstructure would be text file or web logs or images So big data is all about bringing this different kind of data sources You know different variety of data together Crunch them do a lot of data crunching data crunching data massaging and produce some value out of it so that's what big data is and we are using big data applications obviously Google, you know Google analytics how Google managed to analyze entire web traffic the regular basis then How Twitter gives us the trending topic for each country Then how Amazon recommend a particular book or particular product when you are buying a similar product So these are all big data solutions now Before we further go into Understanding what is big data? It's you know, we should first look at what is not big data now We all know what is food packets or food packaging Okay, now the technology of food packaging was derived from rocket science or aerospace science, whatever you want to call it because Astronauts they have to stay for a long time in space and therefore a Technology was developed to store the food in a way that it stay as it is for the long time So we can say that food packaging is a byproduct of rocket science But we can't say that doing food packaging is same as doing rocket science, right? similarly with the rise of big data There are several technologies which have evolved in terms of processing and storing data example in memory data storage and data processing or Fall tolerant Hadoop cluster to store and process data in a distributed manner Now if you are doing your OLTP or your transactional database and just writing sequel With this new technologies that is not big data. Like if you are running your transactional database in memory or If you are building a data warehouse on a Hadoop cluster, that's not big data Okay, because at the end of the day, you are just doing sequel. Just just dealing with the structural data. So now How exactly an enterprise can do big data, what is the right way of doing big data? So the right way is asking the right question what you want to know Once you have that question Then you can figure out whether you need a big data solution or not and for that you can use this analytics continuum So now if you can figure out the answer at the descriptive analytics, which is about what happened in past That is our traditional data warehouse and business intelligence, you know telling you What was the total sale of last quarter or what was the profit in 2015 in a particular region? So that's that's descriptive analytics if you can find answer there. You don't you don't need big data. Okay, the next level is Diagnostic analytics now diagnostic analytics is about why things are happening and it is useful for root cause analysis and fraud detection The the answer of your question if it is leaning towards prediction Then you need predictive analytics, which is about what's going to happen next and the last level is Prescriptive analytics where you want the system to generate different recommendations now with each level of this analytics the The amount of volume variety and velocity of data will increase So big data, we can say that big data is about Processing data with 3v properties volume variety and velocity, but for different big data solution The this 3v may vary So there is no standard prototype or sort of a framework for that every different Problem would need a different solution At the center of big data processing is map reduce Map reduce is a programming model and it has nothing to do with map and reduce function of Python. They're completely different Okay, map reduce is a programming model that that process large data set and Return a collection of key value combinations You can write a map reduce program run it on your local machine fine But the real power of map reduce can be seen when you run that on a parallel distributed environment for for example here, let's say You have a big text file and you want to know How many times a particular word has occurred? Then you can write a map reduce program to do that and run that in a distributed environment and get your result quickly and Because distributed environment is involved big data processing is more about Horizontal scaling and not vertical scaling So horizontal scaling is when as the demand of your data process and data storage increase You add more machine into into an existing cluster Whereas in vertical scaling which is our traditional database server as the demand increase We add more memory more processing power more storage. So big data is more about horizontal and not vertical Now I don't want to go into too deep technical side of map reduce But let's try to understand it with a very simple example let's say there is somewhere the parallel universe where Edition of 10 digit is a big task or it's a big data task and Being an enterprise you want to do that you want to add 10 digit and the time bound is one hour You want to do that in one hour? So there's one PSD guy who says, okay, I can do that. I will charge you $100 for that and I can do that in one hour Fine. That's one way of doing it. What is the map reduce way of doing it? So instead of hiring that PSD guy, what we do is we hire one graduate of mathematics or graduate of edition Right and who can only add five digit and he will charge you $30 and he can do that in 20 minutes And along with him we hire five high school students who can only add two digit And they'll gonna charge you $5 and they can do that in 10 minute now with this team you can do edition of 10 digit in just 30 in this 20 plus 10 30 minutes and 30 plus 25 55 dollar So that's the power of map reduce almost half the time half the price The condition is that this five students should work in parallel environment, right? So That's that's good, but one critical thing is missing in this example There should be someone or must be someone who is responsible just to responsible for dividing dividing that 10 digit into subset of two digits and Handing over that two digit to five students when they are done Accumulate the result and hand over to master the the graduate guy, right? So who's going to do that? That's where we have Hadoop now Hadoop is a framework Hadoop is a cluster cluster of computers that provide us the parallel distributed environment on which we can run map reduce program Okay, now a Hadoop cluster can have one or more master node and it has multiple slave nodes in our example we can say the blue guy is our Master node and the high school students are our slave nodes the backbone of Hadoop is is HDFS which is Hadoop distributed file system now HDFS is predominantly responsible for dividing your large data set into small subset You know distribute among all slave node copy or map reduce program Do the process accumulate the result hand over to master node and so on so HDFS is the core now Being in enterprise who wants to do big data you have to invest in Hadoop Hadoop cluster can be of three computers thirty computers or three thousand computers It completely you know dependent upon your requirement. So and you also need human resource Hadoop administrator Hadoop engineer So that's a lot of cost now. Is there any way to do big data with you know more cost efficiently? Yes, of course, that's where we have cloud so So cloud computing is all about using someone else computer right in a in a very simple way I can say that so there are providers like Google Amazon and several other who provide Hadoop as a service So you don't have to invest in Hadoop in-house What you can do is you can just focus on map reduce program and once you are done run that in our Hadoop cluster Which is somewhere in the cloud? And believe me, it's it's very easy to run our map reduce programming cloud. I have an example here So for this example, I'm using Amazon and Amazon service is called Hadoop service is called elastic map reduce See you can see it's very easy what I can do is obviously I have to create an account on Amazon and then I can I have to create environment variable for authentication Once I'm done with that what I'm doing here with this first command is I'm running a map reduce program written in Python on my single machine and I'm playing with very small data set hundred thousand record once it is You know once it is running fine, then I'll I go and run that again, but oops. Sorry. I run that again on Hadoop cluster on Amazon and That's it. I don't have to do much change there Just two new parameters and that's it and once they are running fine Then I go for the big big data problem I run that with 1 million records and now I'm running that on a Hadoop cluster of 20 machine And just with that one command. I don't have to do anything This command will go talk to the cluster or talk to the cloud Create the cluster configure the cluster copy the data do everything and finally give me the output in that file That's it. I don't even know how to do Hadoop when I'm running this. It's that simple now. I know you don't believe me So I have a demo Now you can see here that Be the demo code with us oops Yeah So I'm running that command with Hadoop cluster with 20 machine running that command and I Have to skip that quickly It's going to talk to the cluster create the cluster and do everything for me show me the log information And finally I will get the output in the file and This is the dashboard of EMR elastic map reduce it took 42 minutes to run this and This is by default configuration And as you can see here out of 20 machine one was my master node and remaining 19 where my slave node which Amazon call them core node and This is by default configuration Okay, I'm not doing anything. This is all by default So that therefore this machine are low-end machine not a like a superpower machine. Okay default configuration is a medium machine and one dot medium But what you can do is once you are you know getting used to this what you can do is you can log in Configure the Hadoop cluster beforehand You can have maybe a Hadoop cluster of 10 machine each with 256 GB memory or more and a lot of CPU power And you can do this in maybe in four minute or 42. So that's the power of cloud computing Now let's talk about Python finally so why Python why why I wrote that program in Python Reason is Python is awesome first of all and The it's easy to learn and it's very intuitive we all know that but the real thing is it got really powerful features to work with large data sets and The backbone of Python is obviously the vibrant open-source community and reusable packages And remember in big data the data will flow from different sources Translictional data your JSON XML data text and Images sometime you have to scrap websites different websites to get data for analysis So Python already got packages for that. So as an programmer, you don't have to start from scratch and that saves a lot of our time Other thing is that in the beginning apart from Java Programming language are was very popular among the big data and data science community but from last five to six years Python has emerged as a preferred choice of language for big data and data science and In fact when I started doing big data, I did our certification and now I hardly use our Because practically there is nothing that you can do in our which you can't do in Python There might be a few things which is good in our but you know There is no such a task that I can see that you can do only in our in you can't do in Python and the big advantage is ease of deployment Now Python is you know, it's used for a lot of things web development desktop applications There are erp's in Python. So when it comes to deploying a big data solution Into an existing workflow or in a production environment By the Python has a you know as an upper hand over our And obviously the the recent release from Google the machine intelligence library called tensor flow I believe it's a and that's in Python. So I believe that It's a living proof of Python's increasing dominance in the big data or data science community I have an example here so let's say we have three user and User one and two both have watched and rated Star Wars part one and part two For some weird reason the user three has only watched and rated part two Now looking at this image, we can easily recommend user three to watch part one, right? But how can you automate this recommendation when you have millions of user rating? For thousands of movie by millions of users How do you you know? Automate this recommendation engine you can do so by writing map reduce program in Python And this is our sample data like let's say we have a lot of data like this user ID movie ID Rating and the timestamp and obviously we also have a lookup table for movie name, which I'm I've not included here So you can write a map reduce program to do that and there's already a package called MR job map reduce job In Python that you can use now I don't have time to go into each line here but what I want to highlight is as you can see I'm importing the package first of all and Use after that using the feature of MR job To to solve the problem and at the end here I'm Configuring a pipeline of map reduce job because one map reduce job is not good enough to to handle this situation So I'm creating a pipeline where the output of first map reduce become input to second and so on And this is the output of that and you can see here in the highlighted line there the big night and rush more both been rated by 241 user and Their similarity of their rating is almost zero point nine six five four so almost zero point ninety seven Right which is out of one So we can say that if somebody has watched big night, you know, we can recommend him rush more and vice versa But how do we get that zero point ninety seven? That's where data science comes into picture So you can figure out the similarity Similarity measure between two movie ratings of two movie by using the cosine similarity algorithm Now cosine similarity algorithm tells you the similarity measure between two vectors So therefore you have to first create vectors of movie rating For all possible pair of movies that have been watched by same user out of that one billion Then only you can apply this algorithm So that creating that vectors of movie rating is the map reduce job and Once you have the data ready for data science, then only you can apply data science So this is what big data and data science is all about and that's just one way of doing it by the way because the cosine similarity is a Actually it falls under the big umbrella of machine learning and there are several machine learning algorithm as well so Let's look at machine learning now machine learning is all about You know Recognizing patterns in the data and doing predictive analytics with that and obviously we are using applications of machine learning now Have you ever imagined how our traffic camera can read your number plate? You know and or how an email application can now classify your email as a marketing promotional email normal social media email and and the inbox or Whenever you upload upload photo on Facebook it it highlight human face and ask you to tag It will never highlight dog or a cat it will always highlight a human So a machine is able to understand that there are human faces on a photo So that's all machine learning applications now There are several machine learning algorithms, but you can classify them as a Three type in a three type for the first is supervised now in supervised machine learning algorithm you have a lot of data obviously and You can see the data in features and outcome For example, you have let's say you have all the house prices of entire New Zealand So your features would be number of bedroom number of bathroom total area of the house whether there's in garage or not there is a Garden or not what is the city suburb right? So these are all features and outcome is the price of the house So once you have all this data you can run supervised machine learning algorithm on that which gives you a mathematical model Now with that model when you have a new house you can add the feature of that house and it will predict the price So that's supervised unsupervised is you don't see the data as features and outcome You just have a lot of data and you run unsupervised algorithm and it creates clusters It divides the data into several clusters. So closely related data will will be put in a one cluster so it is very useful to find outliers and To do root cause analysis fraud detection one example could be the the fraud The catching the fraud credit card transaction on the fly not after it had happened and you get a call You can just catch it on the fly. So that's unsupervised now reinforcement. I believe it's a very extreme end of data science It is the algorithms in reinforcement They are inspired by how a human face a human brain react to punishment and reward and It's used in the system where system has to take small decision without any human Guidance for example a self-driving car deciding whether to stop or to speed up at a yellow light So in enterprise big data analytics My opinion. I don't believe we need to look into reinforcement too much There might be some case studies where you can use it, but for a normal big data analytics reinforcement in the extreme level So and it's very easy to do this in Python you can see that The first one here is linear regression, which is a supervised machine learning algorithm and just few line of code that 10 line Maybe and this one is also supervised. It's called logistic regression It is a kind of supervised machine learning algorithm, but here you don't predict Outcome like house price instead you predict classification Yes, or no or a or a or a digit a photo of a digit the digit is one two three of so on example of this would be Predicting whether and user account is hacked or not based on the click stream data So that's logistic regression You also came in so that's unsupervised machine learning algorithm and you can see just eight line and four are common So four line of code and it will you know divide my data into cluster, but remember this that We have to first prepare the data to do data science Okay, and this is all because of the package called sklearn and Now you can also use tensorflow and there are other packages as well that you can use which already has all the mathematics done for you and Data scientists are responsible or they are Educated to do this that they can fine-tune this algorithms to improve the the predictability of the algorithm so that's what data science scientists are and the Data engineers are those who actually capture data from different data sources prepare the data for data science Right, and that's the reason because of the big data data scientists are Mathematicians trying to learn programming and programmer trying to learn statistics and mathematics Because that's like to discipline, you know merging together So I will conclude by saying that doing big data right is all about asking the right question Then figure out you need whether you need a big data solution or not using the analytics continuum And if you need a big data solution Identify all different data points where the data will flow from and obviously you have Python packages to manage that different data sources and Do a lot of data crunching with Python prepare the data for data science again use Python We apply data science You have a solution Deploy that solution That's it. Thank you We do have a few minutes of questions if anyone Would like to ask Going once yes Well, thanks for the talk that's really informative I'd like to get into data science and we do find some good substantial amounts of Data I was in the same. I was in the same boat Yeah, that's actually that's really good question So I believe if you want to learn data science the first thing you can start with is machine learning because That's what data science is all about applying those algorithms There are a lot of stuff, you know in machine learning that you can learn Different algorithms and then you start from machine learning and the you go towards deep learning. Okay So that's that's one discipline But if you want to be a complete package also start learning data engineering expect of it and start with map reduce There are a lot of different technologies now spark and The heaps actually several of them but if you start with map reduce it gives you a good journey that you can go from one step to another because Map reduces the one where it all started. You know the first product that came out from Yahoo research So you understand my produce and then good you can you know, then go further then you learn Python Different packages of Python how you can scrap website or how you can load data from Twitter API Facebook API or Then you have to learn about you know managing that JSON and XML files into Python You can also learn some no sequel like MongoDB Dynamo DB like that because you have to manage different data sources Another thing you can learn is how you can do image manipulation in Python so so Machine learning to go forward data science and this data engineering expect to be a complete package Any other questions? Going once Going twice. No, please think please join me in another round of applause