 Hello, welcome to a session on installation of Apache Pig and its basic commands. This is Dr. A. M. Pooja, professor in computer science and engineering department at Volchin Institute of Technology, SolarPol. At the end of this session, students will get familiar to installation of Hadoop and Apache Pig as well basic commands of Pig. Now, there are some prerequisites for installing Apache Pig. You need to install Java 8 package on your system. You also require Hadoop 2.7.3 package and you can install Hadoop on Linux based operating systems especially Ubuntu or CentOS. Now let's see how to install Hadoop. Initially we download Java 8 package and save it in the home directory. We extract Java tar file using tar command. Then we download Hadoop 2.7.3 package using tar command. Then we add Hadoop and Java parts in the bash file that is .bashrc. Then save the bash file and close it. For applying all these changes to the current terminal, execute the source command that is source.bashrc. To make sure that Java and Hadoop have been properly installed on your system, you execute Java version and Hadoop version commands. Now we need to edit some Hadoop configuration files. So we change to directory Hadoop which is stored on its home directory. Then we list all the contents of this directory using ls command. Open core site XML file, edit the property inside the configuration tag as shown in the snapshot below. Now core site .xml informs HadoopDemon as where the name node runs in the cluster and what is its name. It contains configuration settings of Hadoop core such as IO settings that are common to HDFS and MapReduce. Then we edit HDFS site .xml. We edit the property mentioned in the configuration tag as shown in the snapshot below. Now this file contains configuration settings of HDFS daemons that is name node, data node, secondary name node. It also includes replication factor and block size of HDFS. Then we edit MapReduce that is MapRedside.xml file. We edit the property mentioned in the configuration tag as shown in the snapshot below. This file contains configuration settings of MapReduce application like number of JVMs that can run in parallel, the size of mapper and the reducer process, CPU codes available for a process, etc. In some cases this file may not be available. So we have to create this file using MapRedside.xml template as shown below. Then we move to edit yarn site.xml and edit the property inside the configuration tag as shown below. Now this file contains configuration settings of resource manager and node manager like application memory management size, the operation needed on the program and algorithm, etc. Then we edit hadoopenv.sh and add the java path as shown below. Now this file contains environment variables that are used in the script to run hadoop like java home path, etc. Go to hadoop home directory and format the name node. So we change to the directory hadoop 2.7.3 and we format name node. Now this formats hdfs via name node. This command is only executed for the first time. It formats the file system means it initializes the directory specified by dfs.name.dr variable. Remember one thing, never format running hadoop file system or else you will lose all the data stored in the hdfs. Once the name node is formatted go to hadoop's bin directory and start all the daemons. You can start all the daemons using single command that is dot slash startall.sh. The above command is a combination of three commands as shown here or else you can run and start individual services as shown below. That is we first start name node, then we start data node, then we start resource manager. Resource manager is a master that arbitrates all the available cluster resources and helps in maintaining and managing the distributed applications running on the yarn system. It works to manage each node manager and each application's application master. Then we start node manager. Node manager in each machine framework is the agent which is responsible for managing all containers, monitoring the resource usage and reporting the same to the resource manager. Then we start job history server. It is responsible for servicing all job history related requests from client. To check that all the hadoop services are up and running run the below command that is dot jps. Now open Mozilla browser and go to the local host to check the name node interface. Now after installing hadoop, we start with the installation of Apache Pig on Linux. So we download pig tar file, extract the tar file using tar command as shown here. Now edit dot bash rc file to update the environment variables of Apache Pig. So we set this so that we can access Pig from any directory. We need not go to Pig directory to execute any Pig commands. Also, if any other application is looking for Pig, it will get to know the path of Apache Pig from this file. Then we edit dot bash rc and add the following at the end of the file. That is we add Apache Pig home and as well as hadoop path in this file. Run the command source dot bash rc to make the changes get updated in the same terminal. Now to test whether Apache Pig is installed correctly or not, you can check the pig version. That is you run the version command of Pig. You can also use Pig help to see all the Pig command options. Run Pig to start the grunt shell. Now grunt shell is mainly used in Pig to run the Pig Latin scripts. The Pig runs in two modes that is map reduce mode and local mode. Map reduce mode is a default mode which requires access to hadoop cluster and HDFS installation. Since this is a default mode, it is not necessary to specify hyphen x flag. You can execute only Pig or Pig hyphen x map reduce. The input and output in this mode are present on HDFS. Local mode. With access to a single machine, all the files are installed and run using a local host and file system. Here the local mode is specified using dash x flag. So we use the command p hyphen x local. Now we will see some of the commands that are run on the grunt shell. So we will first see the shell commands. So if I want to run shell commands, I need to use sh for example, shls which list the contents of the shell. Then if I want to use FS commands, I need to use FS minus ls. So here it shows the output. We can also use utility commands provided by grunt shell such as clear, help, history, quit, set, execute, kill and run to control Pig from the grunt shell. Now clear command is used to clear the screen of the grunt shell. Help command gives you a list of Pig commands or Pig properties. Simply displays a list of statements executed or used so far since the grunt shell is invoked. Quit command is used to quit from the grunt shell. Set command is used to show or assign values to the keys used in the Pig. Now let's see some of the key value pairs here. You can set the number of reducers for a map job by passing any whole number as a value to the key default parallel. Similarly, you can turn on or off the debugging feature in the Pig by passing on of value to the debug key. You can set the job name to the required job by passing a string value to the job name key. Similarly, you can set the job priority by passing any one of these values to the key job priority. Similarly, for streaming, you can set the path from where the data is not to be transmitted, passing the desired path in the form of a string to the key stream skip path. Now let's see EXCC command. Now using the execute command, we can execute Pig scripts from the grunt shell. For example, consider a file named student.txt in the Pig data directory of HDFS with the following contents. There are three records of three students giving the student ID, name of the student and the location. Assume that we have a script file named samplescript.pig in the Pig data directory of HDFS and its content are as follows. Now let us execute the above script from the grunt shell using EXEC command as shown below, that is EXEC samplescript.pig. So the output shown here is like this. The EXEC command executes a script in the samplescript.pig. As directed in the script, it loads the student.txt file into the Pig and gives you the result of the dump operator on the monitor displaying the following contents of the txt file. Hill command kills a job from the grunt shell. The command used is hill and the ID of the job. Run command. You can run a Pig script from the grunt shell using run command example. Consider the same student.txt file in the Pig data directory of HDFS. Assume that we have a script file named sample script.pig in the local file system with the following content. So this is the content of the Pig that is load the student.txt using Pig storage as followed by the list of the fields. Now let us run the above script from the grunt shell using the run command, that is run samplescript.pig. You can see the output of the script using the dump operator as shown below. The difference between EXEC and run command is that if we use run, the statements for the script are available in the command history. So that is all. These are some of the references. Thank you.