 Hello everyone, and welcome to this Eucadator Service webinar introducing Hadoop. I see there's still a lot of people joining, so I'm going to give it a couple more seconds until we actually start the presentation, but there's already a lot to be on, so thank you for joining. Okay, let's just start with some introductions in the meantime. So I am Marguerita, I work for the Eucadator Service and I'm based at JISC in Manchester. And presenting today is Peter Smyth, he's a research associate working for the Eucadator Service as well, and he is based at the University of Manchester. Just a few things before we start, you have a menu on the top right corner of your screen, which hopefully you have time to familiarize yourself with. You can collapse and expand it using the red arrow, and you can type in comments and questions and the questions box. We probably won't be able to answer them until the end of the presentation, but if you do have a comment during the presentation, please do write it to us, and all attendees are muted, so we won't be able to hear you. We are recording this webinar, so if everything goes smoothly, the recording will be available on our website, it will be under using events and then past events. And last thing is under handouts, I have uploaded the slideshow in PDF format, so if you do want to download it, you can do so now, and if you want to write comments on it, you can do that. But the slides will be available on the past event page too. Right, so before I hand over to Peter, I'd just like to see if all of you can hear me, so I'm going to launch a poll, just to check that you can all hear me, okay? Okay, so all of you at the moment can hear me, all some struggleers, but most of you have voted now. I just have a slide for people that can't hear me. I'll pull it up in a second, so it looks like 98% can hear me, so that's really good. However, I've got one slide here up for people that can't hear me, of course talking to them won't help, but there's some numbers that can dial in as well, and they can check their speaker and headset, so hopefully the sound will improve then. Right, so I'm going to hand over to Peter now, and he's going to start the presentation. Okay, thank you, Margarita, for that. Our presentation today is five sessions welcome, which Margarita's already done, so I'll welcome you as well. We're going to look at what big data is in terms of the definitions, why big data, which might be translated as what is made with data being in the first place. We're going to look at the processing of big data with Hadoop in a very simplified format, and then at the end we're going to show you some examples of using the Hadoop system using Hive, which is a component of it. So what is big data? I've got an idea, and you'll have heard of all the definitions starting with the word V. It started off with three Vs, volume, velocity, and variety, and then people started adding a few more, so we've got seven there, you can probably find more in the dictionary. But we also need to consider where this data is coming from, so we've got various sources of big data, new social media, GPS, modern systems for generating data, many of them. What kind of data are we talking about? Big data is quite often associated with unstructured data, free form, text, audio, video, it's stored in different formats, JSON and XML, and stored often in no SQL type databases. Further that we've got destructive data, now this is normally associated with more traditional small data in these formats, but of course if a table is big enough it's going to be big data. And then finally, what can we use to process big data? There's Cassandra, there's MongoDB, there's probably many more, but the one we're going to talk about today is Hadoop, just one of many. So despite all the words and everything I've said about big data, perhaps the easiest way to think of big data is this little definition from Wikipedia. Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate, or if we put that another way, your use of processing a little bit of data on your PC in your favorite applications, whether it's SPSS, Starter, R or whatever, and it gets to the point where the data is just too big to fit into your desktop application. At that point, you personally are going to consider that to be big data. But we've already mentioned in the previous slide that it comes from a variety of data. And one of the key things about these new sources of data is that they haven't been generated specifically for analysis. And in addition, the unstructured data is typically verbose in the layout and constitution. So you can also have situations where the data simply contains more than we actually need for analysis. That's derived from the fact that it's not actually being generated specifically for the analysis. So we've had a look at structured data. This is the more the traditional data, so the tabular data and so on. Relational database systems like Oracle and SQL Server and so on all rely on structured data. So it consists mainly of tables where you construct a table, you put column names, you say the types of data that these columns are going to contain, and then you populate the table with the rows and rows of data in the right format, in the right position in the table. So the disadvantage of this is that it makes it somewhat inflexible in dealing with changes in the structure. If you suddenly got a new column that was needed, or worse, you need to reorder columns, then you've got a bit of a problem. On the other hand, an advantage is that you can perform a lot of validation checks as you load the data, because you already know what type of data you're expecting in a certain column position. On the other hand, unstructured data is free format data. There's no guarantee what data items will be included, nor the order which they'll appear. When data is originally received, it's just stored as is. So you've got a data source coming from somewhere, you read a record in, you just stored as a data, you're going to process it later on. When the data is subsequently processed, then you worry about the values from specific fields. Despite the name, however, there is still some structure in unstructured data. So here's some examples of unstructured and unstructured data. You can see the top one is an Excel format, where you just have a worksheet and you've got your column names at the top of the table, and the rows just represent the data. The CSV and the tab delimited below that are very similar and they're very compact. But they are assuming that having given the row headers at the beginning, every row of data that follows is in the correct format and in the correct positions. When we look at the unstructured data examples, some of XML, let's see, structured data. So on XML, we've got, again, within a header, we've got something called a column and we've got, clearly, we've got a name QN, which is a column name, and we've got the PC column and the HN column and the QA column. But all of these column names are preceded and afterwards they have tags associated with a column and forward slash column. And similarly on the cells further down, each cell value is preceded and post-seeded or anti-seeded or whatever it is, with the backslash cell. So you can see this format, although it's very descriptive and it clearly allows you to change things around a bit, it is quite a verbose system of recording the data. The JSON format, which is probably the most popular unstructured format out of thought, is also comes with everything in a row of data contained in the row of data. That is, each row contains what in other terms would be called a column name followed by the value for that column and so on and so forth. So they've got exactly the same data represented in JSON. Now the thing to remember about that is that the next row would again repeat QN and quotes followed by a value for the next row of data and so on and so forth. So in fact, every row of data has the column names repeated. So again, you can see why that becomes verbose, but it is quite a flexible format. So that explains the verbosity of data, which can help make it big. Let's look at another example. This is a map of a trip across Manchester. The data used to produce it was recorded on a GPS logger and you put your GPS logger into your PC or the USB port and the software provided draws you nicely on a map like that. So you can see where you've been. Very common application and what you're seeing on the screen is the typical use of it. But you can do other things with it as well. So the data that the GPS logger records will include a timestamp, a longitude, a latitude value and various other bits of information and it does that in regular intervals every 15 seconds, say. Now it uses all that information to plot the graph, plot that table, map, I beg your pardon. What we're going to do, we're just going to use the longitude, the latitude and the timestamp and then we're going to take a bit of a liberty. We're going to use Pythagoras and we're going to pretend we're part of the Flat Earth Society and we're going to use that to generate some graphs of a different type. So the first graph is a sort of a speed against time graph and you can see on this graph that at the beginning of the end the speed is very low, that represents walking and the section in the middle, up and down, up and down, up and down, that is a tram journey across Manchester. Pretty well what you expect. I had to use the same data in the same way to generate another graph of distance versus time. Very simple graph, start off slowly because I'm walking, I'm on the tram so I've got constant speed, get off the tram, start walking again, it slows down, that's all very, very obvious. The point about this is that if you compare those top two graphs with these two at the bottom, they look very similar. You can see the start stop, you can see me walking on the left hand side, on the right hand side, they're almost identical shapes, but the bottom two graphs are drawn with only every tenth data point that was available to me. So in fact by discarding 90% of the data, I've actually lost very little information at all and that's a type of decision you may have to face when you're looking at your data. How much of it do you need, what can I get rid of, how can I get rid of it? And so we look now at another source of examples and explain where the problem comes in. If we consider tweets, if you send the tweets it's very small. I should just point out the scale at the bottom of your screen is a pseudo log type scale just to give you an idea of volumes. There's no attempt to be precise at any of this, but a single tweet is going to be less than 1k. If you actually try to record tweets from Twitter.com using the various means of doing so, you'll find that a tweet, same tweet if you like, is probably about 4 to 5 kilobytes in size. If you, for your analysis, you needed to tweet from all of the tweets from a user spanning several years to a prolific Twitter, you could have well over a gigabyte of data to worry about, but so that's not too bit of a trouble on your application, your desktop application. If, however, you want to look at all of the tweets from a user and all of the tweets from all of their friends and potentially their friends as well, then it's very easy to see how this data quickly is going to get out of hand and will well into the 10 plus gigabax here. Or it could be if you've got lots of friends. Looking at this from the other point of view, let's look at some smart meter data. Here you're presented with a data set of smart meter data from whatever source. There is a source of smart meter data in the UKDS discovery system, so you can find some there. And this is going to start off as a very large file. This file, the data records in this file are recorded every half hour, which is possibly not what you want. So if you want to aggregate that smart meter by day, you can reduce the size. If you do it by month, you can reduce the size even more. And if you reduce it by month and just look at a specific geographical area, you can get it down to something very manageable, I'm sure. The point about this A is that if you want to put in an arbitrary line, approximately five gigabytes in this case, and decide that anything on the left of that side of that line, you're happy to treat in your desktop application. You feel you can process this happily. So you're not going to consider that big data, so you're okay there. And anything on the other side is big data. Now the problem is that in both these scenarios, there is a situation where you're in the big data environment. And at that point, even if you're subsequently going to discard a lot of the data, you don't really have any choice but to treat it as big data, and therefore you need a big data environment to work in. And that's where Hadoop comes in. So Hadoop was created by Doug Cutting and Mike Carfarella, 2003, 2004, it's based on the Google file system. MapReduce itself has been around for 40 years and first came in the language list in 1961. And just for a bit of information, the elephant that you see at the bottom of the screen, which is the icon always used to depict Hadoop, it comes from a cuddly toy elephant owned by Doug Cutting's son. And if you Google Doug Cutting and Hadoop Elephant, you'll find lots of pictures of the real elephant and you'll find it's a far scrawny affair than the beefed up one there, obviously done to represent big data. So let's have a look at the infrastructure of Hadoop. A Hadoop cluster consists of thousands of nodes. Each node is an individual computer many times more powerful than the average desktop. It's the same in that it has RAM, and it's got disk, and it's got memory, but a lot more of everything. The minimum number of nodes you'd really need to have a cluster would be four, but you can't have many thousands if you want to, if you can afford them. Certainly there likes a Yahoo and Google will have thousands in their clusters. The strength of Hadoop though is not this raw power, despite the fact you've got powerful computers in there, but it's the ability to break a processing task down and run all of the parts in parallel, effectively divide and conquer, that's what it's going to do. It's also built to be very resilient and cope with server failures because if you think about it, a low servers are very stable and don't break down very often. If you've got a cluster with thousands of them in them, the chances are one's going to break every day. If you go to Hortonworks.com or several other places, you'll find plenty of diagrams showing, in this case it says Hortonworks data platform, other companies have their own name, but essentially this diagram is depicting a Hadoop system with various components in it. Some of them are essential components, others are optional type elements. So we're going to look at this in a very cut down way. We'll look at HTFS, we'll look at MapReduce, and at the end we're going to do a demonstration using Hive, which is SQL based. So that's our minimal environment. So starting with HTFS, to the end user it's just a file, and later on when we do the demo I'll show you this looking just like a file system. But internally the files are segmented into blocks of 128 megabytes and randomly distributed across all of the available data nodes, whether they all get used, you don't know. You don't really care either. The data node is a server as a Hadoop cluster where actual processing takes place, i.e. where your programs are run. So HTFS is going to place the data onto these data nodes, which are just big computers, and they're going to be run from there. The name node is another server in the cluster, which is going to keep track of where all the blocks of your data are, but you don't want to do that, you just know it as a file name. Just as an aside, if you consider what an ordinary client server system looks like, so on your desktop, well, in a traditional client server, which Hadoop is not, it's normal for the data to be stored on a server, like a file server, and on your desktop you'll have office applications such as Word and other things. When you need to edit a Word document, a copy is sent from the server to the desktop, and you edit it there. When you finish, the new version is copied back to the server. That is, the data comes across as moved to meet the program. This makes sense because typically, the MS Word program is many times larger than the document is edited. Now in a Hadoop environment, it's a bit different. So on the left-hand side you've got a traditional client server, and the data is moved to the program. On the Hadoop side, because the data is so, so much bigger than the program, it makes more sense to do it the other way around. The program is moved to where the data is, and that's exactly the way Hadoop works. One of the reasons why the name mode needs to be practically correct of where all your data is. The other Hadoop component is MapReduce. The two parts are MapPart and a Reduce part. They're just names. Nothing to do with MapsReduce, although it typically involves aggregations, and doesn't necessarily mean reducing like smaller. They're both effectively smallish programs, typically written in Java, but can be written in other languages such as Python. In a couple of slides, I'm going to show you, walk you through a demonstration of MapReduce. There's a couple of things you need to know about MapReduce. The first one is, it's got to be shown for completeness in any basic Hadoop type introductory webinar. The second thing you need to know is you're probably never actually going to have to do it, because there's so many alternatives in the different Hadoop components and different things that you can do. We're just showing this to give you an idea of what you're missing, or what you will be missing. Now, the scenario I'm going to walk you through in MapReduce is, let's pretend you have a questionnaire of 10 questions, yes or no answers. It's completed by a random number of households in each postcode area starting with the letter M, and you want to know, by postcode, the percentage of yes answers for each of the 10 questions, okay? So, this is what is going to happen. The storing of the data in HDFS means that it is automatically split into blocks and spread across multiple data nodes. I mentioned that earlier. The mapper process is sent to each of these nodes, which is part of your data, to process the block of data on that node. The records output from the mapper process are sent to another data node, on which the reducer process will run. Again, both cases are just standard computers, big computers, where you're going to run a program. All records with the same key value will be sent to the same data node, which will be one which will run the reducer process. The output records from the reducer process are sent to a final HDFS, each reducer produces its own output file. So, diagrammatically, this is what we have sort of. Now, I'm going to pick a record at random, like this one. So, this is just one of the records in your input data set. This record was moved to this machine, okay? This is a machine piece where all of these records are going to be processed. And the output of your mapper process on this machine is to simply rearrange the record so that the question number, that's QN3 in this case, is going to be counted as a key, and the rest of the record is a value. Now, from your point of view, it's just reorganizing the record. The reason I have described it as keen value is that the key part, it's significant for the next step. Because the next step is what's called a shuffle step, and that sends the record with the same key value to the same reducer. So, this highlighted area now is the reducer. It's just another data node somewhere in the cluster who is going to receive all of the question 3s from your data set. And the job of the reducer, in this case, is to do an aggregation. So, for each poster code that I get a record for, it's going to sum the records up and work out the percentage of yes for question 3. And when it's done that, that record is sent to the output data set. And that's it. So, at the end, on the output data set, we've got the percentage just records by poster code area, okay? Moving on from that, there are alternatives to MapReduce. But, first of all, you really need to understand your data. You only need to use Hadoop for big data and big data tools because it's too big. We've said that already. You have two choices. Either you can perform the analysis inside Hadoop itself using specialized tools, or you can use the big data tools to transform and reduce the size of your data if it's a viable option. And then download results to your favorite desktop application. And that is the scenario where it gets in. But if you do have to analyze it inside Hadoop, there's plenty of add-on products that you can use. There's Mahoot, which is an add-on to Hadoop. There's Spark, which works with Hadoop, but it is also standalone. And in Spark, there's something called MLlib, which is a machine learning library internal to Spark. There's also something called HiveMount, which is an add-on to Hive, which again can do data analysis type functions. So moving on to our demonstration in Hive, what we're going to assume is we want to reduce the large data set into something that your desktop can handle. For this, we're going to use Hive. The Hive Processing Language, which will be using HQL, is based on SQL. This makes it easy to learn, well, easy if you have any SQL experience. And even if you haven't, it's a lot easier to learn than Java. Now, where are we getting our Hadoop from? In this particular case, we don't really have a proper Hadoop cluster to work with. So what we're going to use is a virtual machine, a sandbox virtual machine. These are available from the likes of CloudDera or Hornworks, which is the one we'll be using. You can also get similar to the sandbox VMs in the cloud under Azure or AWS, that's Amazon Web Services. UKDS are currently developing a cloud Hadoop system and also have it on-premises Hadoop system for secure data, but they're still currently under development. Okay, the environment we're going to use, like I said, it's a Hornworks sandbox. I'm using VM player on a PC. We're fixing gigabytes of RAM into the likes of the processor. If you look at a guide for installing Hortonworks, it will be on our website in a few weeks. The Hortonworks side of that guide will tell you that you need 10 gigabytes to run the VM. In fact, you can run it in six. So if you've got a PC with eight gigabytes of memory, you should be able to run it. Well, I won't say happily, but you'll get away with running it at least. Now, in the demonstration, I'm going to show you the sandbox provided web interface, and I'm also going to show you a third-party Windows-based Hive interface package. This package is currently in beta release, and it's free. I'm not sure if it'll remain free when it gets into its final release. Okay. This is what we're going to do. We'll look at the dataset, select the filtering, generate a sample of records, a random sample of records. This is an aggregation, and we'll create a dataset containing a subset of data. Next month's webinar, what is Hive? We'll cover these in more detail, like loading data tables, creating tables, et cetera, et cetera. Okay. Now, this is where it gets tricky because this is where we go into a live demo mode. Excuse me for a minute while I check if I've got all the right things available to me. Okay. When you first run your sandbox in VM player, whatever, you end up with a screen like this. You don't have to touch the screen in reality. All you're interested in is this web address here because that is the web address that you're going to put into your web browser, any modern web browser, and you'll get this screen up here. If you click on the advanced options, down here you'll see a reference to Hive, and this is what we're going to use. It gives you the address. You can just click on that. It gives you a user ID and password. Now, I've actually already logged into Hive, and this is the screen that you're going to end up with. Within Hive, you can access the various other components of Hadoop, some of which we're going to use, some of which we won't. So, what we're going to look at, we're going to use the Hive UI, which is the Hive editor. It's where we can write queries. We're going to look at each catalog, which is where your tables are going to be stored, and file browser where your files are being stored. We're actually, we are just going to look at those, and then we're going to move on to the other system to do the demonstration. So, if we start at the file browser, for user Hive, there's a little bread drum down here, you see all the users, little folder icons. You click on the folder icon, it shows you the files in the folder. Just what anyone would expect of a file system. Hive makes use of tables. So, one of the things that you often have to do is create a table based on your file. And again, the interface helps you out here. It gives you an option to create a new table from a file. If you click on that, you give it a table name, you can give it input file. If I say choose, I pick one of those, or I can upload a file from my PC. So, it's quite flexible, it'll do most of the things that you really need to be doing. We won't let you do it because we've got all the files in place for what we're going to demonstrate. The query editor, you just write your queries in here in HQL, select star from question there. Question there. Semicolon, very important at the end. And then you run execute, and it will run that for you. You don't have to worry about anything down this side particularly. And then when you want to execute, the results will appear on the screen. Now, everything I've just shown you is quite a complete way of doing most of the things you want to do. But for this demo, I'm actually going to use this deltoed for Apache Hadoop, which has all the same functionality. You can see here all my tables, ways of importing files and what have you. The file point of view, all I really want to show you is these little queries down here which we're going to run and see what happens. So, I've got my questionnaire file loaded, stored on thing. If we just do a select star from questionnaire, it will run that and it will show you all of the records in the file. So, this is based on the example we did before. So, I've got post codes, begin with M, question numbers, house numbers, question numbers, and answers randomly generated, okay? If I wanted to know how many records are in there, I could run this one, select count from questionnaire. I won't run that. I'll just tell you there's about 640,000 records in the file. The next thing we want to do is look at ways of cutting down this data. And the first way of cutting down is, well instead of you asking for four columns, why don't we just ask for two columns? Which may seem a bit trivial when I've only got four to start with, but if you've got a data set which has tens of hundreds of columns, this is a very important way of reducing the data down to what you actually need. The other obvious way of doing it is to use the statement down here, which in this case I'm going to take all of the columns, but I'm only interested in the rows where the postcode is M65AU. So if I run that and let that run, you can see it comes back all M65AUs right across the board. So again, if you can filter out record types or record values that you don't need, then this is the way to do it. And obviously you can use it too in combination. The next example is aggregation. There's a few things worth pointing out in this one. The first one is aggregation functions like sum and count and there's average and there's some statistical ones like standard deviation. You can use in place of or alongside or in association with column names. The point, the problem with doing that of course is that the result doesn't actually have a proper column name. So this as statement here as total yes is effectively creating an alias for what that column is. And so this is effectively the new name of that column when I produce the result. So I'm just gonna run that, do I run that? No, I didn't. So when that runs, this one will take a few seconds because it's more work involved for it. But when it comes back, when we see the answers we're gonna have three columns, we're gonna have postcode, total yes, and total questions and postcode. And they're grouped by postcode. So in fact, there's only one row per postcode and I've got the total number of yeses in those questionnaires and the total number of questions in that postcode. Of course, the reason that all multiples of 10 is because there's 10 questions in the questionnaire. The next I want to show you is, well, what if I just want to random selection? I've got a very large file. I just want to take a random selection and download that and see if that's any good in my PC. So here, this is where you do it. You effectively split your data set. You tell the system I want to split the data set on a number of buckets. And in this case, I've chosen 64 buckets. You can have as many as you like. The more buckets you've got, the less records are gonna end up in each bucket. So if I run this one, I'm dividing by 64 and you can see I get these records back and just 9, 14, 18, 19. If I was to run that again, because I've put a value in the RAND function here, it means I will always get this consistent set of records back. So if you need to be able to reproduce what you've done, that is the way to do it. Before, well, everything we've done so far just produced a result onto the screen down at the bottom here. But of course, in many cases, you actually want to store that for further processing elsewhere. And the way that is done is to use a create table as a statement. In fact, what I've got here is create table if not exists. And I've given it a name of the table and as. And the query behind it below it is exactly the same as we saw before. Now the point about the questionnaire postcode, if not exist is if the questionnaire postcode does exist, it's not going to do anything. And if I run that, it'll come back almost straight away saying, yeah, that ran successfully. It just didn't actually do anything because questionnaire postcode actually exists down here. I can actually delete it from this interface, but you can also drop them manually. So I've got a drop table questionnaire postcode there. And if I run that, it will actually drop that table. Drop equals delete. If I run this again, it will repopulate it. Or put same ones in there. Now as a final example, I want to show you a mixture of what we've done so far. So we're going to create a table called questionnaire answers. We're going to do aggregation by PC in question number. And we want the postcode. We want the question number. And this little, slightly complex calculation here using aggregate function sum and count is going to do our percentage of yes answers. So I've called that yes answers. And if I run that, what I get back is nothing visible because it's self listening just in returning. If I then ask to show me some of those records from that questionnaire answers which we've just created here, it will come back and it will show me for the first postcode that's got M65AA, questions one to 10 and the percentage of yes answers. And that query encapsulated the whole of our map reduce example that we showed before. So this is a very clear indication why you really don't have to worry about map reduce because there's always far, far easier ways of doing things. Okay, that is the end of the demonstrations. I think that is the end. So we just have a summary now. So let's remember, there may be no choice in using big data, especially if you don't control a source which increasingly is the case. And do open big data tools are just the tools. You're going to use them to get the data the way you want to deal with it. You need them to do all of your analysis or to cut the data down to size for your preferred desktop application. Although Java MapReduce are available, there's an increasing number of other tools available which will make the processing simpler. And that is the end. Thank you, Peter. That was very interesting. I'd just like to add a few things about the work that you can do to service is doing around big data. We are scaling up to big data. So we are in the process of setting up the Duke system and liking to be able to process and have large data sets and analyze state, safeguarded and secure data. So all types of data with different access conditions. However, this is very much work in progress on our side. It is a very complex project and we will be updating users. So watch this space. You can follow us on Twitter, Facebook. We've got a blog and an amazing list as well. As Peter mentioned, we will be running other training events. So there's two webinars coming up. One about Hive on the 22nd of March and one about Spark on the 19th of April and you can see them on screen. But we will be providing some more guides about this, how to download a sandbox, for example, through your own computer and we will having workshops. So to get the full list of all of our events, you just have to use the data service news and events and events. And if you do join our mailing list, then you will be emailed every time we add a new event. And I do want to say that this is the first time we're running this webinar on Hadoop. So we welcome any feedback at the end of the webinar. Whenever you leave the webinar, there's going to be a survey popping up asking you just a couple of questions. And we really appreciate if you took the time, if you had any comments or suggestions for improvement, please send them to us. And we'll now take questions. So thank you very much for attending this webinar. Bye everyone.