 Hello. Welcome to a session on Pig Latin. Pig Latin is a scripting language used to analyze huge data sets in Hadoop using Apache Pay. This is Dr. Anita Poojar, Professor in Computer Science and Engineering Department at Walchin Institute of Technology, Seoul Airport. At the end of this session, students will get familiar to the Pig Latin basics and various operators used in Pig Latin. Pig Latin is a data model used in Pig. It is fully nested. A relation is the outermost structure of the Pig Latin data model and it is back where a bag is a collection of tuples. A tuple is a set of fields and a field is a piece of data. Pig Latin statements, they are the basic constructs. These statements work with relations, hence they include expressions and schemas. Every Pig Latin statement is an operator and Pig Latin statements are ordered as follows. They generally start with load statement that reads data from the file system. File may reside either in the local file system or in the HDFS. It is followed by series of statements to perform transformations on the data or processing on the data and finally it ends with a dump or the store operator. Dump operator is used to display result on the console or on the screen and store is used to store the result into an output relation. Now we see how Pig Latin executes. Pig runs in three ways. It runs in either interactive mode, batch mode or embedded mode. Interactive mode. Here the Pig Latin scripts are executed using grunt shell. Grunt shell is invoked either in local mode or map reduce mode. Execute Pig Latin statements on the grunt shell and then we exit the grunt shell. In batch mode Pig Latin scripts are written in a file and the file is saved with an extension dot pig. Execute Pig Latin script in either local mode or map reduce mode. Let's assume we have a Pig script file named as sample script dot pig and the Pig script written into it is as follows. Load the student dot txt which resides on Pig data directory into a relation known as student using the function Pig storage delimiter comma and it imposes the schema as id which is of type int name character array city character array then it dumps the relation student. Now we want to execute this script file so we can execute this script file on linux shell in local mode that is on the prompt we run pig command pig-x local sample script dot pig. We can also execute the same script file in the map reduce mode that is pig-x map reduce sample script dot pig. We can execute this script file on grunt shell using execute command that is exec command grunt exec sample script dot pig. Then the last mode in which the pig can be executed is the embedded mode. Here user defined functions can be written in any programming language such as Java and they can be embedded into the Pig script. Now let's pause the video for a while and try to answer this question which among the following is the way of executing Pig script. So the options are a embedded script b grunt shell c script file b all of the above. Now if you think on this question Pig script can be executed in all these three ways that is as an embedded script or it can be even executed on grunt shell using exec command or we can run this script file execute this script file on the linux shell using pig command. So all the three options a b c apply here so we choose option all of the above. Now let's see data types in Pig Latin to start with simple data types that is integer long float double character array byte array date time and boolean. Now we move to complex data types there are three types of complex data types in Pig Latin, tuple which is an ordered set of fields back which is a set of tuples and map which is a set of key value pairs. Now we move to the relational operations as we know that all Pig Latin statements contain operators so first we move to a set of relational operators there are various categories in relational operators first category is loading and storing which consists of two operators load and store. Load is used to load the data from the file which resides either in the local file system or on the hdfs into a relation in pig. Store operator is used to store a relation to the file system that is either in the local file system or to the hdfs file system. Next we move to the category of filtering operators it consists of four operators that is filter, distinct for each generate and stream. Filter is used to remove unwanted rows from a relation, distinct is used to remove duplicate rows from a relation, for each generator generate is used to generate data transformations based on columns of data and stream is used to transform a relation using an external program. We move to the next category of relational operators that is grouping and joining which consists of four operators join which performs join between two or more relations co-group to group the data in two or more relations group to group the data in a single relation and cross is used to create the cross product of two or more relations. The next category of relational operators is sorting which consists of two operators order and limit. Order is used to arrange a relation in the sorted order based on one or more fails that is either in the ascending order or descending. Limit is used to get a limited number of tuples from a relation. Now the next category is the combining and splitting which consists of two operators union and split. Union is used to combine two or more relations into a single relation and split operator is used to split a single relation into two or more relations based on a certain condition. Now the next category of relational operators is the diagnostic operators which consists of four types dump which brings the contents of a relation on the console or a screen. Describe operator is used to describe the scheme of the relation. Explain it is used to weave the logical physical or map reduce execution plans to computer relation. Illustrate operator is used to view step by step execution of a series of statements. Now let's start with the first category of relational operators that is load and store. They are also called as read and write operators of the pig. So we start with the load operator that is the reading data operator. So we can load data into Apache pig from the file system which resides either in the local or in the HDFS using load operator of pig Latin. So syntax is being shown here. Now let's consider an example. Load the data in student data.txt file in the pig under the schema named student using the load command. So the command is shown here. We are loading the contents of student data.txt using pig storage function and using comma as a delimiter imposing the schema as id of type int first name of type character array last name of type character array phone of type character array and city of type character array. And we are storing we are loading all this data into a relation called student. Now storing data it is also called as write operator in the pig Latin statement. You can store the loaded data in the file system using store operator syntax is shown here. Now let's go to example. Store student info which resides on HDFS into the relation output pig underscore output using pig storage using pig storage and delimiter comma. So output is like this. Now we move to the filtering operator the filter operator is used to select the required tuples from a relation based on a condition. So syntax is shown here. Left side is a relation to name that is after filtering the data is stored into this output relation that is relation to underscore name then we filter the relation one name by certain condition. Now let's move to an example. Assume that we have file name student details dot ext HDFS directory pig data as shown below. So there are eight tuples shown in this file. Now we want to load this content of this file into the relation student underscore details. So the load operator is shown here for loading this content into the student details. Now let us feel use the filter operator to get the details of the students who belong to the city Chennai. So we use the filter student underscore details by city equal to Chennai and we store the result into the left hand side relation that is filter underscore data. Now dump filter underscore data so it will show the output it consists of only two tuples because there are only two students who belong to the location Chennai. Now moving to distinct operator this is used to remove redundant that is duplicate tuples from a relation. Distinct operator works on the entire tuple. So syntax is shown here followed by an example. Let us consider again and file student details dot ext which consists of these tuples. Now you see that there are redundant two tuples here that is 001 and 003. They are duplicate tuples in this student details dot ext. Now since I am using distinct operator it will remove these redundant tuples. So distinct student details and the output is shown is stored into the left hand side relation that is distinct underscore data. Now dump distinct underscore data so output shows only six distinct tuples. Now moving to for each generate operator the for each operator we use to generate specified data transformations based on the column data. So this is the syntax shown here. Now let's move to the example. Again consider the student details dot ext which consists of eight tuples. Now I want to generate the fields id age and city for each tuple in this student details dot ext. So we write the pig latin statement as follows. For each student underscore details generate id age city and the result is stored in the left hand side relation that is for each data. Now the next pig latin statement will be dump for each data and the output is shown here which consists of three fields that is id age and city of the students. These are some of the references that are used for this session. Thank you.