 Hi everyone, I am Rajesh, I have been working on Hadoop related platforms for the last three or four years and very recently I started concentrating on HIVE as well as TASE. So for today's agenda, I am just going to concentrate at a very high level on Audible fixed, okay. So before we get started, right. So how many of you have worked on Hadoop or are aware of Hadoop or how many of you are planning to work on Hadoop. So for today's session, what I will be doing is I will be covering a very high level introduction on Hadoop, especially on MapReduce as well as HTFS, then we will move on to Introduction to Pig and with some working examples and so on. I assure you that I will not get into the APA level details of MapReduce because that is going to be a little more complex and so on. So I will try to cover it at a very high level. So before starting the session, right, let us try to cover some basic concepts with the most famous, world famous word count example itself, right. How many of you are aware of word count example, okay. So it is a fairly simple use case, right. So if you have a large file to be processed, all that we need to do is find the frequency of words in that particular large file, it is a fairly simple use case, right. But we will try to cover as much as concepts as possible with this use case itself. So for time being, consider that you have a 100GB of file, it could be a text file or it could be any file, right. And all we need to do is do some processing on top of it. So what would be the first approach, right. So if you have only one machine and if you are a basic programmer who is very comfortable with shell script or so, we will start off with that, right. So let us start off with some shell scripting over here, right. Is this visible? Not visible? Blur, should return all these slides. So I have a file called largefile.txt, which is nothing but just for demonstration purpose, I just copied the entire node.js documentation onto this particular file, right. So if you look at it, it has got a bunch of text and all we are trying to do is count the number of words in this particular file, right. So let us try to replace all the white spaces with new line characters initially. So once this is done, what we will try to do is we will try to count the unique number of words in this particular text file. It starts printing the word count and the corresponding word, right. But there is a problem with this particular approach, right. Sometimes what might happen is with unique minus c, it tries to capture or it tries to verify the uniqueness by comparing the next line. So what ends up happening is that you have a one equal to over here. That means that equal to has appeared one number of times and the same equal to is happening again, right. You want to avoid all these situations. So if we can dirty trick would be to sort the data and count it again. So before the unique command, let me try to sort it and then try to run the unique command. Now you start getting the result. That is double quotes has occurred 32 number of times. Double quotes with the comma has occurred 16 number of times and so on. And if you want to sort it further what you can do is let us sort it by the first field. So now it starts printing in the ascending order. So there has been a blank space which has occurred 16,000 number of times or so. So this is a very quick and dirty way of figuring out the frequency or the word frequency in a large file or so which is stored in my laptop. But if you have to do the same stuff for 100 GB of file that is going to consume a lot of memory as well as CPU, right. And you have to read lots and lots of data from the disk also, right. So just to give a perspective, let us try to look at one more file. So this particular file is almost 800 KB in size, but let us try to process another file which is like 13 MB or so, right and this will hang forever. The reason is it is trying to spend too much of time in the sorting itself. So basic concept is we will not be able to scale up just with the single machine. Obviously we will be able to scale up by adding more and more processes or more and more CPU and so on or like memory and so on. But that is going to be hard. I mean it has got its own limit, right. What can be the second solution, right? Let us try to modify it into a much better solution or so. Let us say that the management starts giving, let us say we are getting some 10 machines, additional machines or so. Is it possible to solve it in a much better fashion? Probably yes, right. So what we are going to do is we are going to split the 100 GB file into 1 GB chunk, right. So there can be like 100 1 GB chunk files which can be processed by various machines. So let us try to apply the same algorithm that we created on all these smaller chunks and let it write its own output into another file or so. And what we can do is we can have one more machine which reads all these intermediate outputs and then starts aggregating the data to generate the final output. So the advantage of this particular approach is that instead of scanning the 100 GB of file in a sequential fashion, we are doing a parallel processing these days or like with this particular solution we are able to process it in parallel, right. But there are couple of issues associated with the issue that is solution number 2. Can you point out some of them? Do you find any issues with this particular solution? Yeah, exactly, right. So that is one of the problem. First problem is to understand who is going to be responsible for logically or physically splitting the 100 GB of file and then give it to all the 100 machines or like 10 machines, right. So that is problem number 1. Second problem could be what happens if the copied data onto one of the machine fails or the disk can get corrupted, right. So we will have to ensure that the data is replicated in various different machines. So data replication is another problem, right. And similar to that machines are bound to fail. There can be failure in CPU, there can be failure in memory or there can be failure in network and so on. So we will have to deal with fault tolerance, right. And let's say the management approves for another 10 more machines. How do we add to the scalability feature without bringing down any of the machines or without bringing down any of the processes that are running in those machines, right. So there are two approaches to scale up. One is for scalability, one is scaling up by adding more and more processes or memory. Second one is to scale out, right. You add as many number of machines as possible and try to scale out horizontally. So how do we scale out without bringing down any of the processes that are running in the machines? And as you know, I mean multiple programs are running. In order to get the aggregated data, all the set of programs have to complete, right. So how do we ensure or who is going to take care of the situation to figure out, hey all the programs are complete, now you can go ahead and then start doing the aggregation, right. So there has to be some sort of coordination that needs to happen. And this is not the only program that will be running on this particular cluster forever, right. So the management would have given those set of machines for doing some sort of additional programming, right. So if additional program starts running on these machines, how do we make sure the resource management is happening properly, right. How do we ensure that 10% of the resources are given to this particular program because it is a production ready program and how do we also ensure that concurrency is done in a proper way, right. So all these issues crops up the moment we start distributing the application in a distributed fashion or like when we start writing the distributed application, you basically need support for all these features, right. I mean we can no longer write a simple shell script and then say that hey it is going to work, right. If you have to distribute it, we will have to take care of all these issues. So if you look at the distributed processing, I mean we will have to take care of concurrency, we will have to take care of scalability, we will have to take care of data replication, splitting the input data, resource management lot of things comes into picture, right. But if you look at it, initially we started off with solving some problem that is a counting the frequency of the words on a large file, but soon we digressed and then started talking about or like started solving for fault tolerance, data replication and so on, right. So in a nutshell, if you have to write a distributed application, it is really, really hard, right because you will have to take care of all these features. So that is the reason for having Hadoop, right. So what is Hadoop? It is a open source platform for processing large volumes of data, right. So the goals are fairly simple for Hadoop. First one is scalability, scalability as in it should be able to address petabytes of data. It should be able to store petabytes of data and mine petabytes of information on top of that. So it should be really scalable, it should be cost effective, right. So the moment you start scaling up vertically, the cost is going to be very, very high. The moment you go beyond let us say 60 cores or like 100 cores or whatever, it is the cost is going to be very, very high, right. So the solution should be cost effective in the sense it should be able to run on commodity based hardware, right. Third one is going to be, it should be highly reliable. So the amount of data that you are going to petabytes of data that you are going to store in Hadoop should be reliable. Think of an example wherein there are some million files which are sitting on Hadoop and one fine morning some 10 files are vanished or like corrupted. The first thing that management is going to say is that, hey we lost 10 files, it is fine but do we need to store this particular copy somewhere else because today it is 10 files, tomorrow it could be 10,000 files also, right. So it should be highly reliable. So these are the some of the main goals of having Hadoop, right. And it should be fairly easy to create a distributed application on top of it. Let the programmer just concentrate on the business logic alone. Let him not concentrate on creating a fault tolerance system. Let him not worry about scalability and so on, right. So what are the major components of Hadoop, right? So there are two major components. So one for storing the data, which is going to be HTFS that is a Hadoop distributed file system. Second one is the map reduce. It is a distributed processing framework which sits on Hadoop, right. So any questions so far please feel free to ask any questions in the middle of the presentation. So this is the high level logical overview about Hadoop, right. So as I mentioned, there are two major components. One is the HTFS, second one is the map reduce, right. So HTFS is the distributed file system where we will throw all the data on to Hadoop and it gets saved over here, right. So it has got couple of components which we will discuss in the next slide. But primarily it has got a name note which is the master and it has got a bunch of data nodes where the data will reside, right. In the similar fashion you have map reduce wherein it is again a master slave architecture wherein you have a job tracker which is responsible for getting all the jobs doing the resource management and so on. And it has got a bunch of task trackers which are in a different machines, right. This is again a logical view, it is not a physical view yet. So there are two major components. Within these two major components these are the various sub-components, right. So let us look at the physical view, right. Physical view is a little different wherein is this visible? So in the physical view we consider a Hadoop node as a combination of a data node as well as a task tracker. So as I mentioned earlier the data node is nothing but a process which is responsible for storing all the data that you throw at Hadoop, right. And task tracker is responsible for doing the compute related stuff, right. So every Hadoop node will comprise of two major processes. One is the data node and second one is the task tracker, right. And if you have to spin up a Hadoop so you can either use it on a local machine in which case you are going to do some debugging and so on or if you want to spin up a cluster you just keep on increasing the number of nodes on the that is basically the Hadoop nodes, right. So you start increasing the number of nodes and on top of it the master processes are going to be the name node as well as the job tracker, right. So if you have to look at the realistic Hadoop cluster you have lots of nodes which are sitting on this particular rack, right. So once you go beyond a certain limit you will not be able to place everything on the same place, right. So you will end up putting some 20 or 40 node machines on every rack, right. So this is a physical rack where every box represents a Hadoop node which internally has got its own data node as well as task tracker process, right. So every rack will be or all the nodes within the rack will be interconnected by having its own network switch, right. So you can have if you want to scale out you just add as many number of racks as possible and these racks will be interconnected by another network switch or couple of them, right. So this is the high level overview about Hadoop or like Hadoop cluster as it is, right. And the name node and job tracker and all the clients can sit in one of these tracks itself, right. Any questions on this? So we will cover that in the next couple of slides, right. So what is H2FS, right. H2FS is one of the major components out of the Hadoop system, right. So it is a really scalable, self-healing and fault tolerant system. So what do we mean by scalability over here, right. We are not talking about GVs or like TBs of data, we are talking about the petabytes of data storing them on the commodity based hardware, right. And it should be self-healing what do we mean by that. So you store the data but let us say couple of machines go down or couple of data blocks get corrupted it should be able to heal itself and store the copy somewhere else, right. So it should be self-healing and it should be fault tolerant as well, right. So it is not different from any other file system wherein we will have typical files and directory layout and so on. So even in H2FS you have the concept of files as well as directories but you end up putting large, I mean Hadoop is suitable for having large files, right. So let us say you want to put a 1GB file. So every large file gets chunked up into smaller blocks of size 64 MB or 128 MB or like 512 or 256 MB each depending on the configuration that you give, right. So let us take an example of a 1GB file. So if you have to copy that and if your block size is 128 MB it is going to create 8 different blocks, right. So that is the basic unit of H2FS unit, right. So it will end up creating those 8 blocks and every block in that file will automatically be replicated thrice, right by default if you have a large cluster or so, right. So this is to ensure that you do not have or you do not incur any of the data loss, right. One of the problems that I missed out in the earlier one is that you can also make Hadoop Rackover, right. What I mean by Rackoverness is that these are the various racks that are available in the cluster, right. So you can make the name note as well as the job tracker aware that hey these are the various racks that are sitting in the cluster and these are the various machines that are associated with that, right. One of the advantages is that so if you have to place a block in 3 nodes and if the cluster is rack aware what is going to happen is that let us say you are copying one block to this particular node. So the first block gets copied here locally and the second block will automatically be copied to one of the other racks. So it can choose any of the racks in the cluster and it will start copying the second copy of the block onto the remote rack, right. So the third copy can be placed in any of the other machines. So even if the entire rack goes down your data is still safe, right because the data is placed in one of the other machines as well. So you will be able to recover it back, right. And as a part of writing the data also what is going to happen is that the libraries HTFS libraries internally they create the checksums for the data which is being written to, right. So that helps in terms of protecting the data from or reading any of the corrupted data, right. So and also on a periodical basis the data node which is responsible for storing all the block related information it periodically scans the data and then verifies the checksums as well. So if there are any problems associated with that it will instruct the name node as a part of the block data report and it is going to fix the problem, right. So without even the user coming to know that. And HTFS it is a file system where you write once and read many number of times it does not support any of the updates or so, right. So let us look at one level D, right. What are the components of HTFS? So as I mentioned earlier there are two major components within HTFS. One is the name node second one is the data node. So it works on a master slave architecture wherein the name node becomes the master and the rest of the nodes associated with that becomes the slave, right. So what is the responsibility of the name node, right. So it basically maintains all the metadata related information related to your files or directories and so on in this particular location, right. It keeps it in the memory itself and writes it on to its own log files as well, right. So it also has details about how many data nodes are available in the system or available in the cluster at any given point in time, right. So it maintains two primary data structures. One is to understand hey this is my one GB file but this one GB file has been chunked up into eight different blocks and each of the block is replicated three number of times in different machines, right. So where are those location of these blocks? First of all it tries to maintain a file to block mapping where it says hey this is the file and these are the various blocks associated to that and the second data structure is going to be block to node mapping. This is block number one. This block number one is present in three different machines but where are they present, right. So those two data structures are primarily maintained by the name node itself. On top of that if you have a secured cluster or so you might want to have some authorization enabled, right wherein you say that hey this user can access this particular directory or this user may not be able to access the other directory, right. So those sort of authorizations are handled by the name node itself, right. So for the data node, data node is yes there is a secondary name node which is available but I have not mentioned it explicitly over here which is responsible for getting all the images in case there is a crash on the name node you can bring up the secondary name node, right. But in the recent days secondary name node is not active active it was not active active but in the recent days you can have active active. I mean what I mean by active active is that if the primary goes down you will have to manually bring up the secondary name node by copying the images and so on that was the earlier case but in the recent days what you can have is the quorum based journal manager and so on you can have primary name node as well as a secondary name node but both of them writing to the quorum journal manager, right. So when one of them goes down the other one will automatically pick up so there is a hi-chay available over there. So for the data node it accesses a slave so whenever the data node comes up it tries to scan all the different blocks that it has got in its disk, right and tries to report it back to the name node stating that hey these are the set of blocks that I have got and the name node says that hey yeah you can join the cluster and so on, right. So it basically maintains all the block related information on different this every data node can have its own set of disks the typical configuration will be around 4 to 10 disks or so, right and it also periodically sends heartbeat information back to the name node, right stating that these are the set of blocks that I have got or these are the set of blocks that I got corrupted in my machine start replicating it, right. So let us try to walk through a read as well as the write path if we have a HTFs cluster, right. So let us say that we have a client, right who wants to write some data on to Hadoop on to HTFs. So this is the client starts using some of the API that I have provided, right. So let us say again he wants to write a 1GB file the API is going to contact the name node stating that hey I want to write a 1GB file where do I write it, right. So the name node in return is going to say that hey these are the set of this is the machine where you can start writing the data, right. So the client automatically connects to the appropriate data node and then starts writing the appropriate blocks. So in this case B5 represents the block number 5. So it starts writing the data but the advantage here is that without even the user knowing it behind the scene itself it starts replicating to other machines, right. So the block B5 is automatically getting replicated to second data node and the third data node. So if the entire thing is complete it will start returning the success back to the client. It writes more in a pipeline fashion also, right. So once it has completed the information it will be able to store the data. Now in terms of the read path so that if the client wants to read some file it again contacts the name node stating that hey this is my file where do I find the block related to this particular files, right. So the name node is going to return the block as well as the and the block will have internally have the data node information, right. So based on that detail the client using its HTFS API it will be able to connect to the appropriate data node. So in this case it is trying to access block number B2, right. So this is how the read path works in the case of HTFS. Now there is one more thing where I mentioned that hey if there are couple of blocks it gets corrupted, right. So in this case block number B3 got corrupted, right. But there is a concept of self-healing, right. So what happens is the name node recognizes that particular thing and issues a copy, right. So this block number B3 is already available in another machine. So it issues a copy and it automatically starts getting replicated to another machine, right. The B3 which was classified as corrupt in one of the node is automatically copied down to or replicated on to another machine without even the user knowing it, right. So this way it ensures that it is self-healed and it is also fault tolerant, right. If even couple of machines goes down they still can retain the entire data, right. Any questions on this particular? Why is the replication? Oh it needs to ensure that the data is written three number of times. If it is asynchronous it cannot let us say that if it is asynchronous and this machine goes down what happens in the middle of it. But by that time the name node is not even aware of the data which is flowing through, right. The data is not going through the name node. Data is directly getting stored in the data nodes itself. So the name node is contacted only for the metadata related information, right. So if it is asynchronous and if couple of machines goes down then still you have only one replica and by the time you can start replicating even that node can go back. Now it is completely configurable in a typical production environment they set it to three but in your local desktop or so they usually set it to one, one data node, yes. Now if you set it to the replica to only one can only be serviced by that particular data node, right. If let us say this is a Hadoop node, right. So the Hadoop node will have two components data node as well as the task tracker. Now let us say that it is available only over here but the task which is sitting over here needs to access the data. If that is the case it is going to do a remote read, right. But the data can be served only by that particular node because there is only one copy. So this is the success scenario, right. I mean or which one do you want to go through again, okay. So let us say that there is a client, right. He wants to write a file, a movie file, right which is going to be almost like 2GB in size, right. And the block size has been configured as 256. So you again need eight blocks to be written, right. So as a client you cannot directly contact any of the data nodes directly, right. So you end up contacting the name node first then say that hey this is the movie file that I need to write where can I write it, right. So the name node in return is going to say that the name node already has data on all the nodes that are present in the cluster. And it also knows which node is like highly utilized and which node is like not utilized completely. It is quite possible that when you add new nodes the disk are completely empty, right. So in that case it will choose those nodes, the new nodes and then say that hey you can write it onto this particular node. Using the client using the HTFS API, right that is the way that it contacted here it is going to contact the appropriate data node which has been assigned, right. Yes via the HTFS API, yes, yes that is handled in the API level itself, yes. So the user is not aware of that it will try couple of times and if it is not able to succeed further that itself will throw an exception, yes. So user layer is something like I mean your command line thing, right. So as a user I end up writing some commands in terms of storing the data or it can even be via your Java program or it could be using your Python and so on. So let us start looking at a MapReduce, right. So MapReduce is a distributed processing framework which is on Hadoop again it is for the it is basically the processing power of Hadoop, right. So it abstracts out most of the complexities that we saw in the earlier example, right. All tolerance, scalability, lots of concurrency and all those things. All these things are abstracted out, right and the user can concentrate only on the business logic, right. Again it is based on the most famous dividend conquer algorithm and as an end user it provides or I need to implement only two functions, one is the map and the second one is reduce. We will see that in the next couple of slides, right. But it the bottom line is it lets the developer concentrate only on the core functions and you do not need to worry about any of the other complexities in terms of writing a distributed application, right. So instead of going through the MapReduce API and so on I thought we can have a simple analogy which will be easier to understand, right. Let us say that we want to create a simple juice factory or so, right. So what do we get from in terms of processing, right? The ultimate aim is we need to process the various fruits and package them into various different bottles, can be grapes juice or what do you call orange juice or pineapple juice and so on that is a target, right. So what is going to be the input? Input is going to be a container full of fruits, right. So all that we do is we get the fruits, create some smaller baskets, right. We have a front line of workers, right. So we give it to them for doing the cutting and chopping and so on, right. So every worker who is in the front line is given a basket of fruits and all he does is based on the type of fruit is going to chop them accordingly and give it to the next layer, right. And the next layer could be a conveyor belt or so where in each of the conveyor belt or like I mean it is grouped accordingly to ensure that the conveyor belt transfers the chopped fruits into the appropriate grinder, right. So if you have cut apples over here and this is the apple juice, right. So the conveyor belt will transfer that all the apple pieces into this particular grinder from all the front line workers, right. So the data is going to come from here, from here, from here and so on, right. Same thing is going to happen for pineapple and oranges and so on, right. So once the data is transferred over here the responsibility of the grinder is just to crash and then get the juice out of it, right and then package it, right. So is this understandable or like, right. So if you have to apply the map-reduced paradigm over here, cutting of the vegetables or like cutting of the fruits is nothing but your map face, right. You are least bothered about the container and so on, someone is really going to give the basket of what do you call fruits and the responsibility of the front line worker is just to chop the stuff, right. So that is the map face which transforms the input k1 comma v1 into some other intermediate output, right. In this case the input was a vegetable, sorry a fruit and the output is going to be the cut pieces, right. So the intermediate one or the conveyor belt is going to be the shuffle face where you group all the apple pieces together and transfer it to the appropriate grinder, right. So that is going to be a shuffle face and the last one, the grinding part is going to be a reduced face, right. Is this understandable or like so that is a mapping logic. So it is quite for that is true rather I mean for instance in the case of probably oranges you want to remove the seeds in terms of pineapple you might cut in a different way. So the logic might differ based on the input that you are getting but that is embedded into the map what do you call map function itself, right. So what happens quick question I mean what happens if the grouping goes wrong? If this grouping goes wrong what happens? So instead of sending apples to this one let us say one of the guys is sending apple and another guy is not able to group it properly and he sends oranges what happens? You get a mixed fruit, right, everything goes up for a toss, right. So yeah that is a partitioner yeah. So in MapReduce World that is a partitioner which is responsible for saying that hey this is the input and it should be sent to the appropriate reducer. So there are various partitioner that you can have one is a hash based partitioner or you can have your own partitioner algorithm and so on. So that will reside somewhere over here. So if you have a different product wherein you want to have mixed fruit you might have a partitioner which will say that choose 10 apple pieces and some different ratios of different other fruits, right. So that is pretty valid. So now let us look at from the technical perspective, right I mean again not too technical but in very simplistic terms. Mapper all that it does is it takes a key and a value, processes them and then returns another key value pair, right and reducer it takes one K2 or like I mean another key value is going to be a bunch of values, right, processes them and returns the final output. Now let us try to apply it to the same analogy that we saw earlier, right. So if you have to say apply the same analogy basket number and the fruit type is going to be the input, right that is the key value pair. Map it is going to do something or we do not know I mean it is going to chop it off in a different way or try to do something and the output of the mapper is going to be a fruit tag, it could be apple tag or it could be orange or it could be pineapple and so on, right and the value is going to be the cut pieces or the piece number that is the output of the mapper, right. In the case of reducer by the time all the conveyor belts would have delivered accordingly and the reducer all that it does is it gets a fruit tag again what is the fruit tag, fruit tag is the apple underscore something or like orange or whatever it is, you get a bunch of pieces which needs to be crushed and reducer does the grinding work and it creates the output which is again the bottle number and the juice accordingly, right. So let us look at the components of map reduce, right. So map reduce again has got the master slave architecture where the master is going to be the job tracker and the slave is going to be the task tracker, right. What is the purpose of job tracker, right? So it accepts all the jobs from the clients, right. So that is the primary feature of job tracker which manages the jobs which have been submitted by various users along with that it is also responsible for splitting the task and giving it to the appropriate set of task trackers, right. So there is also a concept called data locality in the case of Hadoop the compute is taken to the location where the data resides as opposed to the other architectures. If you look at the typical distributed architecture you will have most of the data stored in a sand and you will have a bunch of servers compute servers and if you have to mine the data the data is going to travel from all the way from the sand disks to the compute place and then start mining the data, right. So as opposed to that in the case of map reduce what is going to happen is you take the compute that needs to be done, right I mean all the algorithm that needs to be run and get it down to the node which has got the local data and then start running it over there. For instance if you have to try to go back let's say that this node has got the data node also, right. I mean this is the Hadoop node it has got the data. Instead of transferring the data from here to some other node and then running the computation it's much easier to take the compute or the algorithm that needs to be run bring it on to this particular node and then start running it locally, right. That is going to save a lot of network bandwidth and so on, right. So who is going to coordinate that job tracker is responsible for scheduling all the tasks in such a way that you get better data locality. So that is one of the primary features of job tracker as well, right. And task tracker on the other hand it is responsible for just running the algorithm that you are given it to it, right. It has got a bunch of starts called map task as well as reduce task it just runs them and then processes it stores the result or like sends back the result, right. So that is the purpose of task tracker. So let's look at the map reduce execution at a high level, right. So this is a client node which has got the job driver or the APIs that are required for submitting the job to the cluster and so on, right. So he summits a job to the job tracker and the job tracker is responsible for giving the task to the appropriate set of task trackers. As a part of scheduling it will also ensure that the data locality is maintained, right. So task tracker internally has got various slots. As I mentioned it can have a bunch of map slots or reduce slots, right. This is again configurable in the earlier versions of map reduce. So again based on the number of CPUs that are available we will try to split it into x number of map slots and y number of reduce slots. So just to give an example if you have a 12 core CPU which is hyperthreaded which is going to be like 24 cores in total, right. You will end up putting somewhere around 14 map slots and 6 reduce slots per machine, right. So every task tracker can spin up 14 map slots and 6 reduce slots and based on the requirements of the jobs the job tracker is going to assign the task accordingly in the respective task trackers, right. So what is the purpose of schedule over here, right. So this is like submitting the job over here and he is assigning the task. What is the purpose of schedule, right. So as I mentioned earlier this is not the only job which is going to run on the cluster it is a multi-train system, right. So lots of users will come in they will have their own requirements some of them will be production jobs some of them could be exploratory some of them could be batch and so on, right. You need to give appropriate splits to each one of the jobs, right. So that is where various schedulers comes into picture which is responsible for ensuring that the appropriate shares are given to each of the jobs, right. So various schedulers are available in Hadoop one is the capacity scheduler, one is fair share scheduler, third one is a default scheduler which is available in the Hadoop distribution. Any questions so far on this? So what are the limitations of the existing map reduce I call it as or like I call it as version B1, right. So basically one major thing is that scalability when you say scalability it is beyond 4000 nodes, right per cluster. So as you have seen job tracker is a single process, right. So if you are having 4000 nodes all the 4000 nodes have to send the heartbeat to the job tracker along with that the job tracker is also responsible for getting all the tasks or like all the jobs from the clients. So it is doing dual purpose, right. One is accepting all the jobs from the clients at the same time it is also handling all the tasks that needs to be scheduled, right. So it becomes a choking point for the job tracker the moment it starts going beyond 4000 nodes per cluster itself, right. Theoretically there is no limit but if it starts scaling up more than that the cluster will not be better utilized. So you will have 10,000 machines sitting on the single cluster but they may not be used completely, right. So utilization problems starts cropping up. Second one is the availability, right. Job tracker is a single point of failure in the existing map reduce, right on the current version of map reduce. So if the job tracker goes down what happens is you will have to restart all the jobs that have been running so far in the cluster, right. So that is a single point of failure. And every node as I mentioned has got a fixed set of slots, right. So if you have 24 CPUs or 24 cores per machine we will end up spending 14 map slots and 6 reduce slots and that cannot be varied, right. So even though your recommend is to or you submit a job which needs lot of map slots you will not be able to utilize the reduce slots which is already given to those machines, right. So those are some of the problems or the limitations of the previous version of map reduce. So that is how the next version came up which is YAN it is called the next gen map reduce. YAN stands for Ate Another Resource Negotiator I will not dig into much details on the YAN architecture as it is. But at a high level they split the entire job tracker into two major portions, right. One is the resource manager which is responsible for only handing the resources which are sitting in the cluster, right. And the second one is the app master. So job tracker's responsibility earlier was to get the jobs from the various clients also handle the resource request from the cluster, right. So that responsibility has been split now. So resource manager is responsible only for handing the different nodes in the cluster or like the resource request from the cluster and for every job that you are going to submit you will end up spinning up the app master, right. So now the app master is responsible only for monitoring the job that a single client has submitted. So that will not impact any other job in the cluster, right. So let's say there are four users who submitted the job and the earlier case the job tracker goes down all the four users will be impacted. Now in this case for all the four users who have submitted the job different app masters will be spun up, right. So if one of the app master goes down only that particular specific user will be impacted, right not all the other rest of the users. So the resource manager basically is responsible for allocating the containers for various applications and node manager is very similar to your task tracker which is responsible for managing various tasks within the node, right. So the node manager is nothing but it spins up various containers where your task can run and return the results back. And the app master is nothing but your the analogous thing is that it's a single job tracker for application. So for if your four users are submitting the jobs you will end up having four different app masters, right. So let's look at some of the deployment patterns and use cases of Fadoop, right. So in most of the companies this is how the traditional data architecture looks like, right. So you will have data gathered from various products, you can get data from CRM applications and lot of other sources, right. So you get data from all the traditional sources and if you have to gain some insights you transfer the data on to any of the adbms or data warehousing platform or it could be MPP based platform like NEDISA or like Teradata or so on, you transfer the data over there, do all the data mining in most of these traditional repositories and then expose the reports to the upper layer, right. So you can have business analytics or enterprise application which might be using the data or the reports that are generated out of the traditional platforms, right. So where can Hadoop fit in this particular architecture, right. Apart from this traditional sources these days we started getting lots of various different logs also. We ignore pretty much the web logs, email logs, sensor data, right, customer data or like customer care data, lots of social media data, lots of data are coming up, right. So earlier it was not possible to mine almost all of them, network data is also one of them, right. Now with the introduction of Hadoop what can happen is you can pretty much take all the data, store it on Hadoop, right and mine the information be it structured or unstructured or semi-structured, mine all the data together, gain some in order to gain some meaningful information, right and then transfer the mine data on to the traditional platforms itself, right. So it is not a competitive, sorry competitive technology it is more like a complementary technology, right in this case. So you are not replacing any of the existing ones, you are adding a Hadoop cluster, offloading most of the ETL related jobs over there, getting lot of data then gaining meaningful information out of it and then storing it again in the traditional repositories, right. So this is one such use case that is a batch processing use case, right and the second pattern is that data exploration, some of the data scientist might be much more comfortable in looking at the raw data itself but they cannot do that in the traditional platforms, right because your traditional platforms cannot support structured as sorry unstructured as well as semi-structured data, right. In these cases what can happen is you take all the data put it on to Hadoop and you can have pretty much some of the what do you call higher layer applications which can connect to the Hadoop system itself, right micro strategy is one of the example. It can connect to Hadoop or talent is another example, it can connect to Hadoop start running some reports and then get the reports out, right. So this is another example, third pattern is more on data enrichment, right. So you get all the sources put it on to Hadoop but it is say that you want to serve the front end applications in a much more faster way. Classic example is you have all the customer related information but whenever the customer comes to the page you want to create a customer profile and then say that hey these are the products that you bought last time or these are the workers that you have visited last time, right. So it has to be in a sub second basis. So in such use cases you will end up having NoSQL, right. It could be Cassandra, it could be HBase or any of those NoSQL related technologies, right. So you can mine the data and store it on NoSQL and put it on to NoSQL platforms, right which can be served again to the higher level applications. Recommendation engine is one of the classic examples, right. So these are the various use cases for Hadoop. What is the clickstream processing, how many of you know clickstream? So okay, so clickstream is something like I mean as and when you browse through the websites some logs gets generated, right. You can capture those logs and then start mining it for understanding the customer behavior, what kind of pages has he browse through and so on. And if you have conversion funnel analysis is something like let us say you have 10 different steps in the web page in terms of getting converted to a customer or so. Some customers try to drop off in the third or the fourth page itself. You want to mine those information, right. So you can use Hadoop for that basically for doing the user behavior analysis as well. RIS modeling this predominantly used in the financial industries. ETL offload we have seen that basically doing the heavy lifting from the data warehousing and then bring it most of the stuff in the Hadoop platform itself and then sending it back to the data warehousing platforms. Filtering spam messages this again Yahoo use case, right. Recommendation engines and mining manufacturing defects and so on. So for instance Western Digital it's a hardware hardest manufacturing company, right. Earlier they were not able to store all their sensor related information, right for mining, right. With the introduction of Hadoop they are able to store all those information on to high space which in turn stores the data on to Hadoop and they are able to get some meaningful information out of it. So any questions so far? So that brings the what do you call high level introduction to Hadoop. So any questions so far on the Hadoop side of things. I have not covered most of the APR related stuff. This is where you group the data and then send it all to all the reducers, appropriate reducers or you are bringing the data, you are sending the data on to the reducer. In the realistic map reduce case so in this case what is happening is the conveyor belt is sending the data to the appropriate grinders, right. But in the realistic map reduce world what happens is the reducers are responsible for asking all the maps stating that hey do I have data for this particular thing, right. So they start pulling the data, but for the analogy I just kept it as a what do you call the conveyor belt sort of thing which is a classic case of shuffle over there, yes. So in the case of real map reduce the sorting is happening in the reducer phase also. So the output of map is always sorted, right. So the reason is the reducer also needs to do sorting, right. So if you look at our example the shell script example that we saw, if you just use unique minus C that will not work out, right. So in order to do some grouping a tricky way is to sort the data and then group it. So that is one of the reasons why sorting is also enabled over here, right. So the reducer needs to do some sorting. If you just give too much of information over there that is going to bring it down. So give it in smaller chunks, right which is already sorted and the reducer also tries to do the sorting in a merge sort way. So the sorting is going to be a lot more faster on the reducer side also by doing the merge sort, right, but which is not depicted over here because that is going to be like that analogy is not brought over here, this one. So this is where the data is given to all the reducers, right. So reducer gets a tag and this bunch of pieces could be from various different mappers also. So piece one could be from mapper one, piece two could be from mapper X or so, but reducer does not know that. He just gets a key and a value of a list of values that needs to be processed then starts mining it. So this is not a this is based out of a Google paper, right. I mean this is not, there are some HPCC computing that is there, but this is predominantly used because it is proven scalable and this can be instantiated in cloud also for instance you can spin up a bunch of nodes in Amazon easy to, right and then install cloud on top of that. HPCC is one, right, open MPA is one, but those have their own drawbacks, right, in the case of HPCC it is really hard to code and debug, right. In the case of open MPA again the same problem happens, open MPA, yeah MPA based, sorry, yes, mapper does not do, client submits a job. So that depends on the input format that you are going to write. So let us say that that is done at the client side itself. So basic question is who does the job of like splitting the tasks or like splitting saying that hey these are the number of tasks that are required for this particular job, right. So again let us say that you want to write or process some 2GB of data, right. So for every data to be read you need to have input, there is something called input format in Hadoop, right. So that input format understands how to read the data, how to split it, right and then create appropriate number of chunks. So your input split or like your input format for movie can say that hey I need to create some 8, typically it is like a block boundary. So it will end up creating 8 splits but the programmer completely has control on how many number of splits needs to be generated. So you can end up saying that hey I have a large cluster. So instead of 8 splits create 16 splits, right. So as a part of your creating the input split itself which is happening on the client side itself. So client is responsible for invoking the appropriate API. The API in turn will create the appropriate number of splits which then gets submitted to the job tracker, yes that is correct, that is correct, no that is one of the drawbacks, that is one of the drawbacks, yeah. So that is getting changed very very quickly, right, that is why Yann came up, right. Previously before the introduction of Yann what was happening was Hadoop was used only as a batch oriented system but Yann has changed the paradigm completely, right. So no this is not from Google, yet another resource negotiator, yeah. So with the introduction of Yann MapReduce itself becomes a pluggable component. So you can have a storm or storm is another processing framework for real time feeds, right. You can have storm or you can have Kafka or any of the other platforms also becomes pluggable onto the existing infrastructure. So you can have MapReduce as well as the real time processing frameworks sitting on the same cluster. So that is made possible with the introduction of Yann, right. It is completely open source, okay. So let us move on to introduction to PIG, right. So before we get started on this particular thing I hope everyone got the PIG distribution or like the .gz file. So let me know if any of you have not got it, not got that. So just extract that .gz file, we will try to use the local mode or the local system itself for carrying out some of the examples. So in introduction to PIG, I will cover some of these topics. High level introduction on PIG, what it is, what is PIG Latin, we will walk through some of the examples, we will try to write some of the, use the different functions, we will talk about some of the debugging strategies and so on, right. So what is PIG, right? I mean it is basically a data flow language for processing large volumes of data. So if you have to write a custom map produced program, it is really hard, I mean we will have to, even though it is just a simple map function as well as reduce function, we will have to go through lots of APIs and you will have to compile the code, you will have to deploy it and if you have to change the logic somehow, you will have to rewrite the entire stuff then deploy it. It has got its own advantages and disadvantages but that is how PIG was created, right. I mean just to abstract out some of the complexities that are available in map produce in terms of writing the code, right. So it is basically data flow language, user does not need to worry about the complex APIs that are embedded into map produced APIs, right and who uses it, predominantly it is used for, since it is a data flow language, predominantly it is used for ETL use cases that is extract the data, transform the data and load it to some other system, right. So it is predominantly used for that and most of the data engineers they also use it, right. The reason is they want to play around with the raw data which is available, they do not know the schema, they want to play around with that, mine the data and then try to make some meaningful information out of it, so for that also they use it, right. So why PIG, right, why not another SQL sort of thing, right or the comparison between SQL and PIG we will cover in the later slides, but writing map produces really complex, so I am not sure whether this is visible or not, but even to create a simple program in terms of converting the case from lower case to upper case, right, you need to write lots of code, this is just a snapshot of it, I mean this is not the one, but this just gives a very brief snapshot of what needs to be written, however in the case of PIG this is completely not visible, completely not visible, ok. So in the case of PIG you just need to write three lines of code, right, to do the same thing, basically you are going to load the data using some of the storage functions which are or store functions which are available or load functions which are available, I am going to load some data dot tsv or csv or so, just invoke the upper function or the lower function on the appropriate data and then store it somewhere else, right. So just three lines of code, what are the advantages, very very less code to debug, right, instead of debugging this particular code I am much more comfortable debugging this three lines of code, right. Second thing I do not need to know about map reduce complex map reduce APIs, I do not need to compile the code, right, it is script based, so you can just save the script and then run it anytime that you want, becomes a lot more reusable, right, you can create some functions and then give it to someone else, right and even non-java program is they can start writing or they can start using this particular scripting language, right. And PIG the engine it has lots of rule based optimizers, right. So if it has to optimize on the fly it can do that, for instance instead of launching three different map reduce job it can figure out that hey this particular script can be optimized, so it will end up launching only one map reduce job, right. So even without the user aware of all these things it ends up optimizing lot of code for you on your graph, right. So where is it installed in the cluster, right. So we saw the Hadoop cluster which has got a bunch of racks and so on, right. But do we need to install PIG on all the machines? Probably not, right. So it can be installed on the client machine because it is a client side library. So as you have got you just need to distribute that particular PIG dot dot gz to any of the client machine which has got access to the Hadoop cluster and then start running it, right. So you just need to put it on to the client machine, start writing your scripts, submit the job, the PIG engine what it is going to do is that it is going to consume the script, analyze it, right compile it and then convert into map reduce job on your behalf, submit it on to the cluster and get back the results, right. So what are the major differences between SQL and PIG? I mean why PIG, why not SQL itself, right. So the first and foremost is that it is a SQL is a declarative language, however PIG is procedural, what do you mean by declarative, right. So in the declarative programming we are just bothered about what needs to be done, right. Take an example of SQL, you tell that hey join these two tables on this particular key give the results back to me. You are not telling SQL that how to read the table, what kind of algorithm needs to be applied for joining the data together, right. So you are just telling what needs to be done or like I mean at a higher level you are mentioning those details, they are not bothered about the algorithms which are used, right. In the case of procedural you end up mentioning all the details that needs to go into the algorithm, like load this particular data, filter it, apply this transformation, do something else, apply this kind of join functions, right. So all these things are mentioned in the procedural language, right. In the case of SQL you needed finite schema, right. Without having a schema you cannot load a table, right. But in the case of PIG you can or cannot I mean it is it is optional in the sense if you have a schema well and good but can you process the data without even having a schema, that is possible as well, right. So it becomes a lot more comfortable with PIG, right. And in SQL you generally tend to write some complex code but in order to read it back you have to go to the middle of the SQL code and then start reading it, right because that is where the core logic lies. So it is going to be a lot more complex however in the case of PIG it is a bunch of script and it can be modularized as well, right. So it becomes a lot more easier to work with PIG as opposed to SQL, right. And most predominantly it will be used, people use SQL for doing the ad hoc analysis, right. But in the case of PIG they use it mainly for ETL extraction and so on, right. So what are the various components within PIG, right. There are three major components one is PIG Latin which is the what do you call the script language itself. Second one is grant which is the command line interface or the CLI with which you will be able to interact with PIG on local mode or on remote mode. And third one is PIG back, right or a bunch of user defined functions which are built into the PIG code itself, right which is already or which is shipped along with PIG also, right. So what are the various modes of running PIG, right. There are four or five modes in which you can run PIG. So first and foremost is the local mode. So as a developer we would like to first try it out on the local machine as opposed to directly trying it on the cluster, right. So in order to do the development you can use it in the local mode. The same thing can be applied for remote mode as well. Let us say once you develop the script, sorry script you want to try it on, try it on excuse me I have some water please. So second one is the remote mode where once you have written the script you have tried it on the local machine you want to try it on the remote machine or so you can invoke it, right. I mean you can run it against the realistic cluster or so. Third one is a grant mode. Grant mode is nothing but your command line interface and script mode is something like you have the script but you want to execute it directly without even you know the grand shell or so and the fifth one is embedded mode. So some people try to run their script in a programmatically way, right. They do not want to launch the grand shell or they do not want to submit the PIG script and so on, right. They want to have some control on the amount of execution that is happening. So in which case they will use the embedded mode. For today's session we will try to use the local mode itself, right. Hope everyone has copied the file, right. So let us start off with PIG, right. I mean have you all extracted the file. You just need to open up the bin directory within that particular folder. So once you extract it that becomes your PIG home or so. We will have a PIG folder. Within the PIG folder you will have a bin directory, right. Within the bin directory you just need to invoke PIG minus X local in terms of launching of CLI or the command line interface or the grand shell in an interactive mode. So let me know if it works out. Can I repeat that? Oh it does not use blue. So it is just a bunch of libraries, right. So that PIG.tata.gz just needs to be extracted on to any of the folder in your machine and it is just a client set of libraries, right. So you can just invoke the command line shell in terms of submitting the jobs. You are not installing you are just copying it, yeah. I am not sure whether blue will work with PIG. I have not tried that. I have not tried that. It is still in local mode. So if you just invoke PIG alone it does not start in local mode. So you will have to invoke minus X local. Otherwise it will launch it in MapReduce mode. So it will try to look for some of the configurations and if the configuration is not available that is a core site, MapRed site and other site, HTFS site, XML, yeah. So it will try to look for those files. If it is not able to find it it is going to prove an exception. So the moment you invoke PIG minus X local it is going to say that hey I do not have an environment I am going to run everything locally. Any issues in terms of opening the shell? So let us go through some of the high level information on PIG Latin, right. So what is PIG Latin, right. So it is a simple sequence of commands, right. To process one set of records, transform it and then give it to another set of records, right. And it abstracts out most of the complexities in terms of writing the MapReduce program. So we have seen that already, right. And it is extremely easy to code and share that particular code with someone else, right. So what are the features that are available in PIG Latin? That has got a rich set of features. So in order to write a data flow language you need to have lots of features, right. Some of them are like you want to filter the data, you want to join the data set with something else, you want to sort the data, group it or express some complex types, right. It could be a map or it could be another tuple or so and so on, right. You should be able to write user defined functions and it should be able to support some input and output operations and you should be able to iteratively process some of the data, right. For instance, if you have a bunch of records you should be able to say that for each of the record do some operation, right. So let us look at some of the data types which are available in PIG Latin, right. So it has got two predominant data types. One is the scalar type, second one is a complex type. In the case of scalar it pretty much supports lot of primitive types, right. Supports int, long, float, double and so on, right. Starting PIG 11 or 12 which I am not too sure they started supporting Boolean as well as data type. Big decimal also. Big decimal and big integer. Oh, Sigmund is not installed. Does anyone have a copy of Sigmund? Okay. So how many were able to successfully open up the console? Only two of them. So how many of you have Mac machines at least? Oh, even you are not able to open up the console. Java path is not set. Do you have Java installed on your machine? Do you have the Java home set? So can you just set that? So open up path, the PIG file itself just change your home, I mean Java home. So how many of you have got Sigmund installed? Are you able to open a different version? Do you have the Java home installed properly or it is not opening up at all? So is it possible to share the laptop with others? So in the directly do a VA of PIG, can I help out or? Oh, you do not have Sigmund. It is hard. You will have to download it. So you have internet, right? So is it possible to download and then let it down? So probably you can try installing Sigmund if you have Windows machine. Are you possible at least share the machine with the other folks as well? Are you able to open up? Okay. Oh, you have gone ahead. You do not need how to for that. So are you invoking this particular command minus x local? Where is your Java home? That is interesting. Where else it is installed? You just said right, then I will just say it cannot look. Is it possible to copy the Java home from Sigmund else? Does that make sense? You have a pen drive, just copy the, sure. Looks like, looks like most of them do not have the Mac environment or like Sigmund. No, we do not have the Sigmund. No, we do not have the Sigmund. Sure. So that is a Java version error I believe. So can you try to run it again? Do a VA of PIG? Oh, you have removed already? Probably your Java version is not matching. Which version of Java do you have? Can I try with that? Yeah. So how many of you have got Sigmund? Or like, is it possible to share the laptops at least? I mean, so that other people will be able to see the screen. I mean, that should be more than enough, right? I do not think it is possible to set up Sigmund and then start working on Windows right away. That is going to be tedious. So for the people who, I mean, who have got it working, if they can share the system with someone else, it will be really helpful. How many of you have opened the shell? Okay. Do you mind sharing it? That is, that is more than enough. So this side, only one have got it, right? Okay. That should be fine. Okay. So it should be, go ahead. Okay. So these are some of the common data types which are available in PIG. One is a scalar data which we talked about. Second one is a complex types, right? So there are three complex types which are available in PIG, right? One is the tuple, second one is the bag, third one is the map, right? So if you look at tuple, tuple is nothing but it is a collection of fields, right? Or think of it as a record in database, right? Database record is nothing but a collection of fields, right? Or collection of columns. So think of it as a record in database world and it is always closed with this brackets, right? So that is the reason that it is marked in a different color or so. So bag, it is a collection of tuples, right? Or think of it like a table sort of thing in the database world, right? A bag will always have a collection of tuples. So in this case, it has got two tuples again within this particular bracket. So it has got Bangalore, India as one tuple, Washington, USA as another tuple, right? And the bag relation is always embedded within the curly braces. And map, it is a collection of key value pairs. I mean it is predominantly presented in most of the languages, programming languages, right? So it is basically a key value pair which is enclosed within the square brackets, right? So this denotes that this is a map type. So we will look at some of the examples, right? The first one is to load the data into pig, right? So you have some data to be loaded. How do you bring it on to pig, right? So for today's example, I have placed a flight data set in the pig bin directory itself. So you will find 2008 sample.csv on the bin folder. So the realistic data is present in this particular site but since it is like 600 MB or so, I just took a sample of some 10,000 lines or 100,000 lines, put it as a sample csv in the bin directory. So if you are able to locate that, it should be on the bin directory itself. So quit your Grinchel and if you do a LS, you should find the 2008-sample.csv or so, right? And it has got basically this structure. It has got year, month, day of the month, day of the week and so on. It has got some 20 or 30 different fields to be loaded, right? So how do you load the data on to pig, right? So there are couple of ways, right? One is pig supports different interfaces. I mean it provides an interface which can be implemented by different loading functions. So couple of them are like pig storage, average storage, hash base storage, hash catalog storage and so on. If you do not specify any of the storage that needs to be loading your data, the default one is going to be pig storage, right? And by default it chooses a tab as its delimiter, right? But in the sample that we have given, it is a comma delimitered one because it is a CSV file, right? So there are the ways that you can load the data is let us say you have raw data and they are just loading dot slash 2008-sample.csv, right? You are not specifying anything. So pig interprets that it is going to be a pig storage loads the data, right? In the second example, they are explicitly saying that hey, this is a pig storage but this is not much different from the previous one, right? The third one is saying that hey, this is a pig storage and the delimiter is going to be comma, right? If you do not specify the delimiter by default, it is going to be tab and if it is not able to locate the entire field or the entire record gets loaded into a single field, right? So in the third example, they are saying that hey, pig storage is or you are instructing pig storage that it is a comma delimited one and you can also optionally pass the schema. So even though your data, sample data had 20 or 25 different fields you chose two parts or like you chose to process only three fields out of it, right? The first three fields which is going to be your month and day of the month and you can specify the data type also along with the schema. So in this case, all the three data types are integers, right? So optionally, you can also load all the fields that are available in the sample also. So in the fourth of the fifth example, what we are doing is that we are loading the entire data and processing all the data types also, right? All the 20 or 30 different fields, right? So what happens if you have more than one file to be processed by pig, right? So based on the features that are available in the loading functions it can optionally support the rejects type of loading, right? So in this case, what we are doing is we are instructing pig that hey, load all the CSVs which matches this particular regular expression, right? Starting with 008 hyphen sample and it could be A or C dot CSV, right? So what it will do is it will pretty much load all the files which are matching that particular pattern and load them up, right? So other examples are how do we load the complex structure, right? So we saw about some of the simple or loading simple data types but how do we load the complex types which are triple or bag or a map relation, right? So as a part of loading this, in the first example what I have given is you use pig storage delimited by comma, right? And you are specifying that I am interested in loading the first three fields which is going to be year, month and date, oh sorry, day of the month but in the next statement you want to convert the year and month alone and group it into a tuple, right? So you can pretty much invoke this particular function which is two tuple and convert the year and month into a tuple which can be recognized in the later sections of the pig script, right? So sorry, most of this thing is not visible I believe, right? Because of the, so does it make sense to open up the grand shell and then me typing some of the commands over here, right? Will that help? Okay, is this visible? Is this the same problem that you guys are also facing? It took a lot longer time, okay? Not sure why it should take such a long time. No, this should not be taking so long, okay? So let us try to, all you can do is load, let us say there is a sample file using pig storage which is comma delimited as you can say year is integer, month is integer and day of month as integer, right? So in the second case all that we are trying to do is that what is the, we are trying to load a tuple, right? So you can say that convert to, this is a variable tuple equal to for each raw data which is going to iterate through each of the record in that particular raw data, right? Generate two tuple of let us say year comma month. So when you say describe is a feature with which you will be able to recognize what is the schema of specific variable has got, right? So in this case we want to print the schema of convert to tuple so as you can see I mean this got convert into a tuple and the tuple gets embedded into the curly braces or what is that, open braces and this entire thing relation is convert into your bag so that is the reason that you have curly braces, right? So this is one way to load the tuple. There are a couple of other ways with the reason version of pig, right? So starting pig 11 or 12 what you can do is you can just embed the braces, right? on the specific fields to represent that this is a tuple, right? You do not have to explicitly say that hey invoke a tuple function or so, right? Let us try to look at the map structure, right? So in your same bin directory you should also have a map.csv, right? The map.csv is separated by a hashtag so the key and value is always separated by a hashtag so you can load a map complex data type using map as the keyword, right? and you can start accessing some of the keys over there so in this case what we are doing is we are loading the map data but we are issuing a filter on city equal to Chennai where it is not null, right? So you are trying to create a relation which will say that which will load all the map data and filtering based on certain condition, right? and you are using a dump result in terms of dumping the data onto the console so let us quickly see the dump feature, right? Oh this one, right? So dump command is used for dumping the data onto the local console it is not present on the production related what we call fixed scripts but if you want to quickly try out some of the things and then dump the data onto the console you can use this so all that it is trying to do is it is trying to process the commands that we are given printing the data, right? So store is for storing the data onto Hadoop, right? So we saw about loading the data so another operator is storing the data so after mining the data you want to store the data onto either Hadoop or to a local file system so you can issue store the raw data onto one of the output, right? So either you can do this or you can use the fixed storage you can use different storage formats in terms of storing the data as well, right? You can also specify the delimiter that needs to be used, right? So in this case I am using comma as the delimiter itself and storing the data back to the local storage itself, right? So let us look at some of the other operators So PIG by default it supports lots of arithmetic as well as Boolean operators which is pretty much similar to most of the other languages plus, minus, division, most of the things are supported it also supports the term ternary operator and it also supports most of the Boolean operators which is AND or NOT and so on, right? One of the interesting operators is the comparison operator which is based on matches so let us say that you have some data which is already loaded but you want to filter it based on some rejects, right? or regular expression in such case you end up using your matches keyword so you load up the data, right? and you want to filter that is filter the raw data by day of the month starting with two star, right? This is just an example, right? So in which case it is going to load the data but apply this particular rejects on the specific field and dump the contents, right? So it also supports a lot of relational operators some of them are listed over here we will go through each one of them in detail, right? So what is for each? I mean if you have a set of records that needs to be processed, right? You want to process it at a record level, right? For each of the record that we loaded do some set of operations, right? or do some transformations that is where the for each will be useful so in this case what we are doing is in the first example we are just loading the data and then generating the year out of it for each of the record just print the or project the year alone, right? Even though you loaded three fields in the load statement you are not going to I mean proceed with the script with all the three fields you are going to project early and then say that hey for the rest of my script I just need the year field alone, right? So you are projecting early and then saying that hey this is the only field that I am interested in, right? In the second example it can also be I mean the for each can also address based on dollar operators or the fields can be address based on dollar operators so the index starts with 0, right? So in this case we have three different fields year, month and day of month and each one of them can be accessed using the dollar variable starting with 0 so the moment you start accessing with dollar 0 year will be accessed, right? and dollar 2 day of the month will be accessed and so on, right? You can also provide alias for any of the fields that you have loaded so in this case you have loaded the dollar 1 field as the day of the month but you want to just change it to month or so, right? So that is what we are doing over here, right? and there is also an interesting operator which is like let us say you have loaded all the fields in the data but you are interested in the first three fields and the last three fields, right? You do not want to project each and or you do not want to write each and everything, right? So let us see this example, right? So let us say that you have loaded all the fields in raw data, right? So these are like some 10 or 20 fields which are loaded but for projection I am interested in year, month and day of month and the last three fields which are whether delay, some other delay and security delay and so on, right? You do not need to write each and everything again and again, right? So all that you can do is for each of raw data generate year till day of month is this visible? Day of month and load from let us say security delay or like tax until translation code, right? Is the spelling correct? I guess the spelling was wrong or so. So in this case what it is going to do is it is going to load all the fields starting with year till whether delay, right? So when you say describe on test it would have printed all these things so you do not need to write each and everything again and again it is better for readability, right? So starting with 12 they introduce the case statement as well, right? So there can be situations where for each of the record that you are processing, you want to go through some case statement, right? So in this case what I have done is I am just trying to print the month based on the based on some value, right? So based on the month field which is the integer we are trying to print whether it is January, February or so, right? And it also supports a else condition wherein which is similar to your default case or so, right? And this is supported only in PIG 0.12 Next one is a filter operator which is used for filtering some of the fields or filtering out some of the records early in the game itself, right? So some of the examples we have already seen filtering based on month equal to 0 and so on starting 0.12 PIG 0.12 they introduced the in operator in a filter. So instead of saying, hey filter the month which matches 0 or filter the month which matches 1 or 10 and so on we can pretty much write a in operator which is filter the raw data on month within the in operator you can specify the list of month that you want to process, right? So this is available in PIG 0.12, right? And also it supports what do you call the conditional variable, right? You can pretty much say, hey filter some conditions based on some rejects and so on. Limit operator is fairly similar to what we have in SQL so basically you want to limit the number of rows that needs to be churned out of the system. So in which case you will end up using the limit operator, right? Starting 0.12 they also introduced a feature wherein you can specify a conditional operator on the limit, right? Sometimes you might end up saying that hey limit to 10 rows, right? Sometimes you might end up saying that hey I need I have a condition so that condition has to be evaluated in terms of finding out the appropriate number of rows, right? So sample and order by or also supported in PIG so sampling is based on basically you have a large volume of data but you want to sample the 10 percentage of it, right? Or like you want to randomly sample some 1 percentage of the data in terms of doing the development activities. So for that you can use a sample operator, right? So you are basically loading the entire data set and saying that just sample the raw data for 10 percent, right? So the value can vary from 0 to 1 which is from 0 to 100 percentage or so, right? And order by is very similar to what you have in SQL so you can have a single field or you can have multiple fields in terms of ordering the data and you can also specify whether it should be in ascending order or it should be in descending order, right? Distinct is for finding out the unique fields or like unique records within PIG itself. So if you are trying to process some data and you want to find out some of the unique flights that are happening I mean that are there in the data set, right? So in which case you load the data and try to create a unique flights where it is based on distinct on carrier and flight, right? So distinct is an operator which is used for filtering out the unique fields or like unique records within the data set. So group operator is a little special over here because in the case of SQL world what happens is the moment you group the data it is usually sent to the aggregation functions, right? But in the case of PIG it is not necessarily true I mean you can send it to aggregate function but it is not mandatory, right? So when you do a group by it does a group by of the fields or the relevant field but it tries to retain the original data set also, right? So in this case you are trying to load the entire data and you are trying to group by the flight number, right? So what is going to happen is that it is going to create a bag relation where the flight number will be a key and against that particular key you will have a list of records, right? So let us try to see an example, right? Describe raw data let us say we want to group by flight number, right? So when you do a describe of group flight number you are going to get the group key the entire thing is a relation, right? I was sorry bag relation so this group is nothing but the group key so you just grouped on the flight number, right? So that is going to be available in this particular group and we are going to get the entire raw data so if a flight has got let us say 10 different records or so 10 different records will be present within this particular relation which is the bag relation, right? And you can pretty much group by different fields also it is not mandatory that you will have to group by a single field so I would like to try an example but unfortunately some of the people are like most of the people do not have the pig installed, right? So let us move on to the split operator, right? So let us say that we have some data but it needs to be split based on certain conditions into four or five different streams, right? So in which case we can use the split operator so in this example what I have done is I have loaded the sample data but if the month is 0 or if the month falls between 0 and 4 you create a split operator and then load it on to Q1, right? If the month falls between 4 and 7 you try to load it into Q2 and so on, right? And you can pretty much store each of those variables separately on to your file so basically what you are creating is loading the entire relation or like entire data but you are trying to split and then create various different streams out of it, right? All the standard joints are possible in pig right from inner joint, outer joint, left outer joint, right outer joint, full joint, most of the joints which are supported in SQL are also supported in pig when you are joining you can join based on single key or multiple keys also, right? Sometimes you might want to join only based on flight number sometimes you might club couple of more keys and then join it, right? So those features are possible and you can join more than one relation also or like one table also, right? So currently we have got only sample table or like the flight table but let's say you have multiple tables or four or five different tables it's possible to join all of them, right? So this gives some example on the joints so basically what it's doing is loading the raw data and projecting only certain fields of the raw data till any carrier flight number, origin and distance and it has got another data set called airports, right? which is also presented in the same bin directory so it has got a mapping between the airport code and the realistic airport name so basically what we are trying to do is join these two data sets and get some insight into it, right? So in this case what we are doing is we are loading the airport data but we are trying to filter based on the airport name which matches Logan or San Francisco, right? So we are reducing the data that needs to be processed at every step in the first case we loaded the entire data but we are reducing the scope of the amount of data that needs to be processed by projecting early second one we loaded the entire airport stable but we are again reducing the scope of it, right? to only two airports, right? Now you can do lots of joints, right? So in this case I am trying to do an inner joint where you try to join the flight data that you loaded, right? based on origin and you try to join it with filtered airport set based on the code, right? that becomes your default inner joint sort of thing and it is also possible to do left outer joint as well as full outer joint or right outer joint all that you need to do is mention that key word very next to the table or like very next to that relation, right? So let us try to quickly check a simple example quick storage, this can be an array as well, right? okay, so let it load so we will revisit that example later still loading so what are some of the joint strategies that are available in peak, right? I mean it has got various strategies one is the default joint which is based on the shuffle based joint second one is the replicated joint so in this case you saw that the airport's data set is really small, right? so in which case what can happen is you pretty much load that small data into memory and process the larger data set, right? so you do not need to load that again and again, right? so that is replicated joint in certain conditions or in certain data sets what you will end up having is that very few keys are overloaded, right? I mean example is the number of people who are in India, right? so when compared to the rest of the world the amount of people who are visiting within the Asia space that is going to be large that is a lot of skewness and you want to process or when you are trying to join two data sets wherein one of them is really skewed, right? you can use the skewed joint and you can use the merge joint if your input is already sorted, right? so that is a pretty rare condition but in case you have both the data sets already joined already sorted and sorted based on the join key you can use the merge joint so I wanted to try out a complete example where you wanted to find the top 10 days or the worst days where the flights got delayed because of the bad weather in Oregon, right? so actually this was supposed to be the realistic example but unfortunately some of the things are not working, right? so we will quickly go through the script again you are loading the data you are filtering the data where the weather delay is not null, right? so this is just to ensure that we filter the data early so if you have 100,000 lines over here this might filter some of the bad records, right? and you are trying to group by the date of the month, right? once you have done that you are trying to do a count count on that particular bag, right? and you are trying to order the data that you have got in a descending order and trying to print the result, right? yeah, but in this case you get a lot more control on what you need to be doing, right? in the case of SQL you also tend to write the filter condition and all of it but you do not control when the filter has to be executed it is completely dependent on the optimizer, right? so you put all your where condition, right? and all your filter condition but as an end user you do not know when it is going to get executed, right? in this case you know for sure that hey immediately after loading the raw data you are filtering out the data, right? so and you also have control on whether it should be replicated join or it should be merge join or what you should do realistically with the group data, right? all those logic is still with you so you are literally coding the algorithm that you want to have nothing changes, so the same thing can be used on 5TB of data or like 10 years worth of data so script does not change, only the input path varies probably yes, that is correct, so that is the responsibility of the pick storage so initially we talked about the input spades or like the input formats and so on, right? so pick storage is responsible for doing that, right? it is responsible for loading the data and computing based on the format it tries to create the appropriate number, okay? oh, no, no, no, so this is a higher level language which sits on top of Hadoop so Hadoop is the basic what do you call infrastructure, right? you cannot write map reduce program for solving each and every use case, right? it is going to be complex so let us say I have to change the filter condition over here you can directly go and change it but in terms of Hadoop or if you write a map reduce program you have to change your Java code, recompile it and deploy it back, right? so in the case of pick, you are writing the script but internally the pick engine parses the script, compiles the code creates a map reduce program for you on your behalf and submits it to the cluster so whenever you submit a job like this Hadoop does not understand pick, right? pick engine is responsible for generating the Hadoop code or the code which is required for submitting the job so this gets converted, compiled, gets optimized then gets submitted still remains the same pick does not know any of it pick does not know any of it it is a client-side library because I am an end user I do not want to write map reduce I am going to use I mean this is a much easier way to do this, right? if you have to change the joint logic let us say instead of grouping by something you want to group by something else that is going to be really cumbersome in terms of changing the map reduce code, right? so it need not be in pure text also the example that we have taken is loading a sample CSV you could load a it automatically the input format takes care of it you have a moving file or you have a PDF file the format pick storage should be able to understand that the internal formats, right? as long as it is able to read that it should be fine so that is the reason that we have various different storage formats one is pick format, sorry pick storage which understands text and most of it we have Avro storage we have H-Pay storage we have H-Cat storage, right? so the user is allowed to implement most of the storage formats that he wants to support just an example, I mean it may not be movie as it is not seen in any one processing movie on hardware you will have some reader, right? so classic example is usually it will separate, separate PDF files, right? so New York Times they wanted to scan all their old records converted into PDF, right? not like that it is going to be in multiple files so in which case that is the reason that I showed in the earlier example that it need not be a single file based on the storage you can support that hey process all the files which are located in this particular directory yes provided you have H-Pay storage for that, right? so it will not be using pick storage it will be using H-Pay storage, yes so let us say that you have H-Pay is a classic example of most equal storage, right? so now let us say that you want to load Cassandra data so as long as you write a Cassandra storage which has the intelligence to communicate with Cassandra, right? and convert into the format which is understood by this one it should be able to load the data the only thing is in the case of Cassandra you are going off the data nodes because Cassandra is going to be installed in a separate set of clusters, right? in this case pick storage or H-Pay storage and most of it is reading the data from the local HDFS itself in the case of Cassandra it will be installed in a separate cluster so you are using the compute of the existing Hadoop cluster but the data is stored in a different cluster, right? I am not sure whether there is a MongoDB storage readily available in pick but nothing stops you from reading the data from Mongo I mean so the load and store functions are like interfaces are exposed, right? the interfaces are exposed in pick it is totally up to the end user to use it the way that he wants but I have not personally seen anyone using MongoDB as a storage I mean for yeah, so I have not seen that so but nothing stops you from writing the implementation for that third party are like different people have might have implemented it right? so let's look at some of the debugging strategies, right? so describe as your friend, right? I mean let's say let's say they are trying to debug something but they are not too sure about the relation that is available for the alias, right? so always use describe as frequently as possible in terms of understanding that, right? and use it very often during the development cycle and if the data structure or the relation structure or the schema is really complex set this variable, right? pretty print, very similar to pretty print so a classic example is instead of saying describe data instead of recognizing this you can set big pretty print schema equal to true, right? you can set this variable so now if you try to describe you get a much more JSON like format, right? so this is much more readable than the earlier one, right? so you can pretty much set it in the console or the CLI, right? Illustrate is another example, right? so let's say that you have written the code but you want to quickly walk through how it's going to get executed in which case you execute a illustrating, right? or illustrate command let's say a quick example so for the above script all that it is going to do is it's going to quickly load the flight data it's going to load the airports and this is the filtered airport set that it has printed and then starts processing, right? so if you want to very quickly run through how it's going to work out you can use the illustrate example explain is whenever you want to find out how big is optimizing the script and what is the realistic logical plan and physical plan in terms of executing the script that's where you use explain plan, right? or explain the alias so this is going to be a little difficult in terms of understanding it because not many people will be very comfortable in terms of going through the path given over here, right? in which case you can convert into a pdf as well, right? you can specify explain minus minus dot on a specific alias which is going to create a dot file, right? so dot file is pretty much understandable by some of the tools, right? like graph wisdom and so on so you can install graph wisdom and then feed the dot file that it has created and convert into your pdf file, right? so now this is much more easier to understand the reason is the same plan which was generated earlier in the command line becomes more like a user friendly thing, right? it loads pretty much two data set and trying to join both of them together and then this is the map phase this is the radius phase much more easier to understand right? this is coming back to the question that you asked, right? I mean if you have to if you have to deploy the same script on to production you might want to change the variables, right? you do not want to hard code most of the variables in the script in which case you might want to parameterize your script, right? that is why the parameter substitution comes into picture so you can replace it with all the dollar variables, right? and at the time of invoking the script you can say minus param input equal to some value, right? so this value will be replaced at runtime in the script and it starts getting executed, right? so there are other ways to do the parameter substitution one way is like you can specify the parameter file name which will contain all the key value pairs of all the parameters that you want to substitute and you can execute using minus param file and you just specify the parameter file, right? so pig also supports macro, right? so you do not want to write some thousand lines of pig pig code, right? which is not readable by some people, I mean it is quite possible that you want to give some of this code or make it reusable, right? so that others can also start using it, right? in which case macro will be really beneficial, right? you can specify a defined function and you can pretty much put most of the things within a function which can be invoked later in your pig script, right? in this case we are just creating a function called filter and ordered by a specific data set, right? and which we are invoking over here so pig also supports user-defined function there are lots of user-defined functions which are given in pig itself as a part of a piggy bank and as a part of the normal distribution itself but there are cases where you want to write your own user-defined function, right? classic examples could be something like your statistical computation or it could be something related to your grouping and so on, right? in which case pig allows you to write user-defined functions load and store we already talked about for instance MongoDB are like Cassandra storage and all those things people can write their own storage load and store functions, right? the two other important user-defined functions are one is evaluation functions and the second one is the filter functions so evaluation function is something like you feed in a value, right? and you want to evaluate that particular expression to something else, so you want to transform the data, right? so that's where the evaluation function will be useful filter functions are used when you want to filter the entire data set based on some complex conditions which are not expressible directly in your pig script, right? that's where you evaluate the filter conditions or like you add the filter conditions and all these UDFs can be written in Java and other languages also earlier it was possible only in Java but later versions of pig I believe starting from .8 or .7 are too sure on that started supporting lots of other languages as well, right? so if you are a Ruby programmer or if you are a Python programmer you can pretty much write your simple user-defined function and plug it on to Hadoop, sorry pig for doing the processing, right? yes, JavaScript engine is also supported I believe but I am not 100% sure on that yes, so not 100% sure on that I believe it is supported I am not too sure on that so how do you write a UDF, right? I mean if you have to create a UDF basically you want to these are the base classes so you have a eval function as well as a filter function, right? so if you have to write a eval function you just need to extend this particular class and implement a specific method in this case it's going to be a exec of people so this is the main method which needs to be extended and you implement your logic over there, right? in the case of filter function this is the class which needs to be extended, right? there are lots of other methods which we will not cover in this workshop, right? so in terms of writing the first UDF you have to set up your eclipse, right? so you set up the class path specify pic.jar as the library to be added to the class path and it should be good to go, right? so let's start off with a basic example, right? so let's say that you want to write a lower function, right? you get a string that needs to be converted into a lower case that's the most simplistic example, right? so all that we are doing over here is we are extending the eval function, right? and you pass in the type of the value that needs to be returned back to the fixed script, right? in this case it's going to be a string, right? so all that you need to do is overwrite I mean this particular method called exact tuple, right? and these two values should match that is the written type of this particular method and whatever you are declaring over here, right? so all that we are doing over here is get the tuple, right? extract the as you know tuple has got fields, right? we are extracting the first field out of that particular tuple using input.get of 0 because it starts with the 0 in the index, right? from the tuple we got the first field that needs to be processed and all we are doing is we are getting a lower case work on this fairly simple one, right? but there are lots of other methods which can be used as well for instance if you want to understand about the environment on which this particular UDF is going to work on we have the UDF context, right? so all those things are not covered so how do we invoke the lower function, right? so again you have the data and all that you do is in order to transform the data you are iterating through each of the record and inverting the lower function the fully qualified package name, right? and for filter function it is fairly similar where instead of extending the val function we are extending the filter function, right? and it always returns a Boolean variable, right? so you get a tuple, extract the fields out of the tuple, evaluate your condition, right? and based on the condition you return whether it is true or false, right? and this is another way to invoke the filter function so you load the data but here you are invoking the filter function that is you are having a custom filter based on this particular UDF, right? so you are sending the unique carrier and trying to figure out whether the unique carrier starts with a letter called W, right? so if it starts up with that it is going to return true and based on that it is going to filter out all the data so apart from that there are lots of analytics UDFs also which are supported in a big out-of-the-box couple of them are rank rollup, cube and all those things which are traditionally available in most of the databases, right? so there are a bunch of other UDFs also which are supported in the name of piggy bank so whenever you extract the picked R file you will have a contrived folder right? in that contrived folder you will have a piggy bank.jab which has lots of these user different functions, right? also you can use other UDF functions which have been returned by some other companies for instance LinkedIn has opened most of their user different functions in the name of data food right? so you can download that add it to the class path in the client side and start working on that, right? some of the UDFs that are given analytics for doing the sessionization or for computing some of the statistics for doing some sort of joins and all of it, right? so let's look at some of the best practices right? I mean these are not all the best practices but these are very few of them, right? at a very very high level, right? so at the job level you always try to enable the intermediate compression, right? what I mean by temp compression over here is that pig whenever it tries to run the job or whenever it tries to run the script it can get converted into multiple jobs map-reduced jobs, right? so if you have a large script file it might get translated into five different map-reduced jobs, right? so it will execute the map-reduced job one then probably might go to the second and then so on, right? and every map-reduced job is going to generate its own output, right? this parameter all that it says is that do you want to compress that particular intermediate output that has been generated by the specific map-reduced job, right? so you can specify temp file compression equal to true and you can specify different codex, right? it supports gzip, it supports lgo and so on I guess lgo has some issues I am not too sure on that but you can support I mean it can support various of different compression formats so the advantage of this particular thing is that the amount of data that needs to be read by the second map-reduced job it is going to be minimal so if the first map-reduced job has generated some one GB of data, right? and by compressing it you are trying to bring it down to let us say 500 MB only that 500 MB has to be read by the second map-reduced job, right? so the amount of data that needs to be transferred that is going to be lot more minimal in this case, right? so in the map side try to I mean try to project very few fields as far as possible, right? you might have in the case of omniture or in the case of clickstream processing you might have a file which has got some 200 or 300 fields, right? but you may need only 10 fields out of that entire file, right? so try to project as early as possible, right? so that is going to reduce the amount of data that needs to be propagated to the next stages, right? and filter early in the game itself, right? let us say that you are trying to load some 100,000 lines but all that you need is only 100 lines out of it, right? try to filter them early so that you do not give it to the next stage, right? so that will save a lot of time in terms of doing the processing, right? and choose the right data type, what do you mean by that? I mean let us say you have just to give an example, right? 10,000 if you have to express it in string it is going to consume different set of characters as opposed to integer, right? so try to use the appropriate data structure as far as possible, right? if you end up using if something can be expressed in int do not use a double or a carrier, right? so use the right integer I mean structure or the type if you are already aware of it, right? but it is not mandatory, right? so in this shuffle case that is transferring the intermediate data from the mapper side to the reducer side, right? always enable intermediate output compression so this is again a map reducer side of thing but if you are writing a fixed script and if you want to optimize something probably this is one of the things that you can consider so the amount of data that gets transferred from the map side to the reducer side always gets compressed and you can choose your own codex as well, right? and on the reducer side you can also set the parallel factor, right? so we talked about grouping joining and so on, right? but we have not specified the number of reducers that needs to be launched when launching this particular job right? we are letting pig to take care of most of the things for us, right? but if you want to have control on that then you can specify using a parallel keyword so if you are too sure that hey, I am trying to join this table or this dataset with another dataset and it is really big and at least I need 100 reducers for doing this operation you can specify a parallel of 100, right? so pig automatically understands that and it instructs mapper div stating that hey, this particular join requires 100 reducers so you can specify that and choose the right join strategy I mean this typically happens when your fixed scripts get out of the production environment, right? so in some cases some I mean people might report that hey the jobs are running too slow and so on so at that point in time I am sure that you are choosing the right join strategy whether it is replicated or skewed or like normal joins or like merge join and so on so that is fairly the set of things that I had unfortunately sorry we could not cover the hands on session that is definitely possible so that is one of the first use case that we discussed, right? the traditional data warehouse will try to bring in data from all your products from different sources right? so it will try to bring in that and you are using informatic in terms of doing some of the ETL related work and then pushing on to something else, right? Informatic is really expensive, right? so also it is specialized in terms of doing some other things now they started supporting Hadoop as well, right? yes, they are supporting Hadoop as well now what you can do is you can take the data or you can instruct Informatica stating that hey get all this data instead of passing locally put it on to Hadoop you do a push down exactly so it can do a push down as well as processing the local environment also in the case of push down what happens is it recognizes Informatica recognizes that hey all my data is in Hadoop, right? as of now I guess only Hive is opened up in Informatica I believe Hive as well as normal HTFS operations because it is not directly supported in Informatica so what you can do is you can load all the data on Hive, right? and use a push down approach where you say that hey this is my Informatica workflow but when it gets executed it gets executed on the Hadoop side of things right? it mines all the data and brings out only the relevant information so my I not recommend that the reason is my SQL is very very limited, right? or you can gather only certain amount of information right? but Hadoop is not meant for that I mean Hadoop is meant for really really scalable systems right? so you can process huge amount of structure as well as unstructured data or like semi-structured data as well so what you can do is you can bring out all your Twitter feeds and all of it put it on to Hadoop, right? you can use it or Kafka or there are various other real-time processing engines also flume, flume is for doing the data ingestion FL UME right? so that is for doing the data ingestion, right? Kafka is one of it, storm is one of it, one of the output of storm can be so it can process the data feeds from Twitter it can churn out, it has the concept of spouts and bolts, right? one of the output can be fed on to Hadoop itself right? so the output of storm gets added to Hadoop, you do all the heavy lifting on Hadoop then you get meaningful insight out of it or like I mean the very cleansed data and so on which you put it on to my SQL or it could be your R acrylic or wherever it is then put a business object of it which can give oh ok so in that case you will probably end up using MySQL with talent or some other open source related yeah so this is more like a I mean cascading is again built on scale and things like that it has got its own set of ways of expressing things but in this case pig is more like a data flow language as opposed to its procedural as opposed to declarative in the case of high we cannot work without a schema, right? whenever you have a table or whenever you are trying to load a data you definitely need to specify the table on which you are going to load the data tuple is the concept which is prevalent that is more like for doing the raw processing, right? so you again express your processing steps in cascading, right? but I am not sure whether I do not think it does the optimizations and so on pig can do the role based optimization on your behalf, right? for instance if you return the filter somewhere down the line it can do a project or like it can push do a push up of the filter and projectively not sure on those aspects so anyone can become a pig developer, right? as long as you understand some of the basic constructs you start writing the code and it knows that certification called pig certified developer or so, right? yeah, I mean once you get comfortable with this probably you will not end up writing complex things, right? I mean not produce some things like that it is just an abstraction on top of it I mean this is one of the predominantly used data pool language not that I have and the entire Hadoop is written in Java so pig is also written in Java but the UDS are I mean it does not restrict you from using Java in terms of writing your user different functions right? but you do not need to know Java in terms of writing pig scripts at all that is one of the abstraction that has been given out so for writing map produced program there is a concept called streaming also Hadoop streaming for which if Hadoop streaming so let us say you are not a Java programmer but you want to write your maps and reduces the initial example that we saw that is the cat, cat, sort, unique and all those examples so you can write a shell script which will act as your mapper and another shell script or a python script which will act as your reducer still you can plug it into map reduce or Hadoop it will still work right? but predominantly in terms of using map reduce you need to know if you have to go one level deep you need to understand Java but in the case of pig do not need that it can be because as long as you are able to express the map function using a python script reduce function using a python script so you are trying to merge both Hadoop as well as pig you can those two are different one in the case of so you end up writing regex I mean map and all those things it will be useful if you have a predefined lookup table sort of thing in this case it was like a state of city mapping or something like that but if you have to let us say take an example of Apache logs you want to process that right? but you end up writing appropriate regex so that you can transform the data to a meaningful state right? so you want to churn out all the unrelated fields right? cleanse them and then start processing and as a part of your grouping you might need to convert into a tuple instead of saying that group by field 1, 10 you might just say that hey group by this particular tuple because I have already defined that particular structure right? so it is possible so there is something called some of the things are not covered over here because it was just an intro session there is something called pig stats listener right? so if you want to figure out how your job is progressing right? so you can implement that particular interface called pig stats listener so as and when the pig I mean the job gets started or stopped or let us say you want to figure out the how many number of records got processed at the end of it how long did my job take right? how many maps or reduces it has taken? so all those information are available but you will have to implement that particular interface or like extant or like extant the figure class also for debugging you may not need that right? I mean but for it will be definitely be useful in the case of production environment so let us say your job was running for 6 months so today I started running for like 2 hours earlier it was running only for 10 minutes all of us then you want to find out hey what is really happening right? so in which case you might have a listener right? with which you will be able to get some meaningful insights not mandatory but using the job counters also you will be able to figure out but if you have to dig down further you can have an interface implementer a lot of things that are happening yarn has pretty much opened up a lot of possibilities right? so earlier it was only for batch processing and it was very much limited to Hadoop right? map reduce alone but yarn has pretty much opened up where you can put your real time processing you can have batch processing you can have data exploration everything happen on a single environment so yarn is pretty much becoming the operating system for data processing yarn as the infrastructure you throw as many number of machines to it it will become scalable right? it is really scalable infrastructure then from that you try to carve out as many nodes as possible or like based on your needs and so on you can plug in all the things let's say you want to plug in map reduce or storm or Kafka or some sort of other real time processing engines all those are big or high so all those things become pluggable components on top of this particular operating system or the data operating system you can feed it in but you might need some adapters so depends on whether you are using V1 or V2 and so if you are not too concerned about the latency and if you are expecting that your data size is always going to grow it makes sense to do most of the passing on or do most of the heavy lifting on Hadoop right? examples are like I have seen some I mean friends who use Hadoop just for passing one GB of data on like 10 nodes and the data set will never ever grow so for such instances probably you are fine with database itself right? I mean it is not going to replace I mean Hadoop is not for replacing any of the existing technology right? it is a complementary technology right? so depends on the use case to use case basis sure yeah I will send it across I will send it across to you sure thanks a lot folks how are where the patient select the visor or visor on this many times you don't get it is very impressive actually in case I am not really asking can we have a life as a live as a live it is very impressive and the learning is getting more solution yes it was a really very good session it is a rubbish yeah can you have comments for the fifth element? I am not sure what that is I mean I am not sure what the fifth element is you are not sure about the fifth element? something is wrong okay so as okay because some will in this so I am telling you again fifth element one of our example which product demo 2014 it will be on July 23rd to 26th it is a four days event and two days workshop and two days conference so workshop will held on Bumbur area and the conference will be at so for that here is the site URL also we gave you the flyers in your hands so that you can see the more information otherwise go to the fifth element and you can ask us via info at healthy.info this is our main idea you can ask us for further studies and secondly I just want to know how was the communication from our site from healthy site was everything was clear or you were not clear how many you got my name because I asked them they didn't check the name okay in the last I just want to say thanks to ABG Parachute who hosted this event and yeah in 10 minutes you will get a hot pizza also yes so there will be just 10 minutes so I can see some of them are going please wait for 5-10 minutes you will get hot pizza via ABG thanks a lot because yeah I can understand hard to come on weekend but you came here and yeah so I can say to all of you please check your name thanks a lot