 Hello, welcome to a session on Introduction to Apache Pig. Apache Pig is a platform for data analysis. This is Dr. Neeta Poojar, professor in computer science and engineering department of Volchian Institute of Technology, Solar Power. There are some prerequisites. Learner should have a good understanding of Hadoop, HDFS commands and SQL. At the end of the session, students will get familiar to what is Apache Pig, what are the features of Pig, anatomy and Pig philosophy, Pig ETL processing, advantages and limitations of Pig. Now, what is Pig? Pig is a high-level programming language which is useful for analyzing large datasets that is, datasets which are in the size of terabytes and pentabytes. Pig is an alternative to map-reduced programming. It is generally used with Hadoop, that is, we can perform all the data manipulation operations in Hadoop using Apache Pig. But Apache Pig works on the top of Hadoop. It is built on the top of Hadoop. It is an abstraction over map-reduce. The Pig programming language is designed to work on upon any kind of data. It provides a very good scripting language called as PigLatin for data analysis. Programmers need to write scripts using PigLatin language, which is internally converted to map-introduced tasks. Apache Pig has a very good component called as PigEngine that accepts the PigLatin scripts as inputs and converts these scripts into map-reduced jobs. Now, why do we need Apache Pig? PigLatin helps programmers to perform map-reduced tasks easily without having to type complex programs or codes in Java. Some programmers, they are not good at all in Java, especially when they want to write map-reduced tasks for Hadoop, they find it very, very difficult. For writing the map-reduced tasks in Java, it nearly takes 200 lines of code, but for doing the same task in Apache Pig, we hardly need 10 lines of code. So, for such programmers, Apache Pig is a boon. It uses a multi-query approach thereby reducing the length of the codes. Pig is oriented around batch processing of data. If you need to process gigabytes or terabytes of data, Pig is a very good choice. But it expects to read all the records of a file and write all of its output sequentially. Apache Pig provides many built-in operators to support data operations like joints, filters, ordering, etc. Now, these are all traditional operations. But in addition to this, it also provides nested data types like tuples, bags and maps that are missing from normal map-reduced in Java. Now, let's see some of the features of the Pig. First is extensibility. Using the existing operators, users can develop their own functions to read, process and write data. That is, they can write down their own user-defined functions. Next, UDFs. That is, Pig provides a facility to create user-defined functions in other programming languages, such as Java, and embed them in Pig scripts. Pig handles all kinds of data, that is, analyze all kinds of data, both structured as well as unstructured form. It can even analyze metadata, key value stores and databases. It stores the results in HDFS. Pig provides a rich set of operators, that is, it provides many operators to perform operations like joints, sort, filter, etc. Ease of programming, that is, PigLatin is very similar to SQL. It is very easy to write a Pig script provided you are good at SQL. Optimization opportunities. The task in Apache Pig optimized their execution automatically. So, the programmers need to focus only on the semantics of the language. They need not bother about the optimized version of the execution. Now, let us see the anatomy of Pig. The main components of Pig are as follows. First one is the data flow language, that is, PigLatin is called as data flow language because it allows the users to design the reading of input from multiple sources, process the data and store the results into multiple machines, that is, multiple nodes. That is why it is known as data flow language. Second, it provides an interactive shell where you can type PigLatin statements and that shell is called as a grunt shell. Third is a Pig interpreter and execution engine, which interprets the PigLatin scripts and executes accordingly. So, you can see that in the first box, a PigLatin script has been written, that is, a is equal to load student, roll number, name and grade point analysis. Then we filter a by grade point analysis, that is, we choose only those records where the students have grade point analysis greater than 4.0 and we store it in a. Then for each record in a, we generate an upper version of the name, that is, we store the upper case of the name and we store it into a file called as MyReport. Now Pig interpreter, it processes and parses PigLatin scripts, checks data types, performs optimization, creates MapReduce jobs and submits these jobs to Hadoop and it monitors the progress. Now, Hadoop takes all MapReduce jobs, distributes them on the multiple nodes and executes them parallely. Now, we will pause the video, think on this question and try to answer it. Pig should be used in which of the following cases? Should it be used when there is a time constraint or should it be used when your data is completely in the unstructured form or when you want to get analytical insights through sampling or when you want to process various data sources? Obviously, option A cannot be chosen here because the PigLatin scripts are executed slowly. The MapReduce jobs written in PigLatin scripts, they execute slowly as compared to MapReduce tasks written in Java. So when there is a time constraint, I cannot use Pig. When your data is completely in the unstructured form, which is the second option, no, Pig can work on any type of data that is whether data is in unstructured form, structured form or nested form. When you want to get analytical insights through sampling, that is data analysis, obviously Pig can be used. And Pig is called as a data flow language, that is PigLatin is called as data flow language because it processes the data from various data sources parallely and then stores the output on multiple nodes parallely. So option C and D can be definitely applicable to where the Pig should be used. Now let's see the Pig philosophy. Pigs eat anything, pigs live anywhere, pigs are domestic animals and pigs fly. So all these four philosophies, they have a particular meaning associated with the Pig. Now let's see that Pig eat anything, that means Pig can process different kinds of data such as structured and unstructured data as we have already discussed. Pigs live anywhere, Pig not only processes files in HDFS but it also processes files in all the resources such as files even in the local file system. Pigs are domestic animals because pigs allows you to develop user defined functions in any programming language and embed them in PigLatin scripts and the same can be handled for complex operations. Pigs fly means Pigs process data very quickly. Now let's see what is the ETL processing in Pig. So these are the use cases, Pigs run jobs on clusters. So it collects data from multiple sources such as ERP, accounting sections, flat files. Then it validates the data, then it fixes the errors or debugs the errors in that data that is it removes all the noises from it, then it removes all the duplicates, then it encodes the value and finally the clean data which is ready to be analyzed is stored in the data warehouse. Now let's discuss some of the advantages of Apache Pig. Apache Pig takes less development time because the map produced tasks that are written in PigLatin scripts are very smaller that is hardly some 10 to 15 lines of code which is equivalent to 200 lines of code if the map produced task is written in Java. So development time is obviously very less for the map produced jobs in PigLatin scripts. Easy to learn because it is very similar to SQL. The one who is very familiar with SQL can learn Apache Pig also very easily and very fast. It's a procedural language. It's a data flow language because it allows the users to read the data simultaneously from multiple sources, process it simultaneously and simultaneously generate results and store it on multiple nodes. It is easy to control execution in Apache Pig. It allows the user defined functions to be written in any programming language and embed in PigLatin scripts. It uses a lazy evaluation technique. It uses the Hadoop features for processing the data. Of course it is very effective for unstructured data and it is basically a pipeline system. Now let's see some of the limitations of Apache Pig. Now errors are not easily interpretable and solved. Apache Pig is not enough matured. It doesn't support much of the other platforms. It has implicit data schema and of course there is a delay in execution because map produced jobs written in PigLatin scripts execute slowly as compared to the map produced jobs written in Java. Now these are some of the references. Thank you.