 Hello and welcome to this brief tutorial about MapReducing, a key concept in cloud computing. For the purpose of this guide, we will be using Amazon Web Services as a tool, more specifically their service, Elastic MapReducing. The main technologies used are Apache Hadoop, which is a collection of open sourced software that facilitates using a network of many computers to solve problems involving massive amounts of data and computation, as well as Apache Hive, a useful data warehouse software project built on top of Apache Hadoop for providing data summarization, query and analysis. Our goal is to parse a couple log files, similar to this one, amounting to several thousands of records. This will be done using this Hive script. An SQL table will be created with this structure, then the file will be parsed based on this regular expression. And finally, the query will output the number of total requests per operating system. Amazon Elastic MapReducing uses an S3 storage unit from where it takes its input data and generates its output data. Amazon S3 is an object storage system built to store and retrieve any amount of data from anywhere. We have created two folders inside, an input one and an output one, both empty. Now to create the Elastic MapReducing cluster. Here one can do this several ways. Click the button here to create a cluster. For further details try the advanced options, or we can clone an older existing one, also including steps. In the options of creating a cluster, we'll have the cluster name. The logging option takes up quite a bit of storage space, so I'll have that disabled. Launch mode is an interesting one, as you can have the option of the cluster automatically terminating after it's done a step, where a step is a certain unit of computation. For software configuration we'll have the core Hadoop option, as it includes also Hive programming. For the hardware configuration we have the M4 large instance type, which is the smallest general purpose virtual machine available. For more specialized application we could have the memory, storage or compute optimized ones instead. As for arguably the most important option in this menu, number of instances. This consists of the number of slave nodes that actually do the task, which the master distributes them, resulting in more speed the more there are but also an increasing price. Last but not least, security and access, I strongly recommend using an AC2 key pair for security purposes. After that go ahead and create a cluster. It'll take a couple of minutes to set up. Once the cluster is up and running, we can assign tasks to it through the steps menu. There are different step types, from a custom Java jar to a Hive program. The script and input locations are taken from S3 so you can access them remotely. And for the purpose of this tutorial we will use the samples given out by Elastic MapReduce, however by using these remote samples we won't be able to modify the input data nor the script. So we will SSH into the master node of our cluster and then download those files there, upload them to our own S3 storage so that we can modify them and further build onto that application. Once inside we first use the HadoopDFS command in order to download the Hive script. After that we can see it here. Notice that it uses the location input cloud formed data in order to get its input, that's where the log files will reside. Now that we know that we must match that path into our S3 storage, we're going to input and create two more folders, CloudFront and inside data. Now in order to download the logs we make two directories, CloudFront and data as well. Here we run this command and the logs have been downloaded. There are six of them in total. They provide a substantial input for our example. Furthermore we upload them to the S3 storage that we have and the same for the script we initialed. After that we check our bucket to see that the data has arrived here and here it is, the input and the script file. We are ready to add our step. We use a Hive program as mentioned before. The script location is inside our storage unit, the input as well and the output should be an empty folder. The final arguments assign the SQL version of the Hive configuration. Click add and the process has started. It takes around one minute to compute all the six log files. After the step has been completed we can check the output folder for the result of the query. Download it and we can see the numbers, the number of logs using each of the OS. In order to reduce the overall cost do not forget to terminate the cluster whenever it would be idle as there are no upfront costs when creating a new cluster and you can always clone on all the one to save some time. And now it is up to you. You can change the output scripts or the input files for different results. Thank you for watching.