 Hello everyone and welcome to this UK Data Service webinar on Hive. I'm Margrita and I'm an outreach officer working for the UK Data Service in Manchester. And presenting with me today is Peter Smyth. He's a research associate working for the UK Data Service too and he's based in Manchester, at the University of Manchester. I'll hand over to Peter who's going to talk more about Hive. Thank you Margrita. Before we get into this I should point out that I have got a bit of a cough today so if everything goes quiet it means I've remembered to mute the mic before coughing and if it becomes explosive it means I've forgotten to mute the mic before coughing. So I apologise in advance. So on with the webinar hopefully everyone can see the screen. Just listen to the full screen. So what we're going to talk about today, give a brief definition of what big data raise and the problems that you can encounter working with big data. Then we'll introduce you to Hive and hopefully the majority of this webinar will go through some examples of using Hive to examine data and reduce the size of the data. And then when we reduce the size we'll show this new smaller datasets being used in desktop applications. So what is big data? Well for the purpose of the webinar we're going to use a standard definition from Wikipedia. Big data is a broad term for datasets so large or complex that traditional data processing applications are inadequate. In practical terms what that means is it won't fit into your desktop application so if you're used to using Startout or SPSS or even R it's just not going to fit. It's something which is too big by definition to fit into those applications. So how's it become too big? Well if you look at the traditional sources of data like surveys and census this has been designed and collected specifically for research and mine. Certainly surveys, less so census. But again the census data is typically broken down into different areas of knowledge where people will typically want to do research on that in the area. So you can treat this a bit like survey data. So the point is that this is data which has been collected almost purely for researchers. But now we've got new data sources appearing. And these data sources come from social media. So Facebook, Twitter and various other ones. We've got census data, lots of automobile type stuff, GPS type stuff, CCTV cameras, Fitbit your heart monitoring stuff and you've got virtually every application on your mobile phone probably uses data or tuning from web servers and so on. And then we've got the transaction data. So we've got the traditional, I say traditional, the now traditional way of shopping using eBay and Amazon and things like that where all of your clickstream, every time you click the mouse it's recording some data. It's not just recording the purchases you make, it's recording everything you look at and how you even move the mouse across the screen. We've got all your bank cards where transactions at the tail are being recorded and of course you've got online banking itself. All of these data sources produce vast, vast amounts of data and what makes them a bit different from the traditional sources is that these haven't been collected primarily for the researchers. The companies which are collecting the data and providing means to collect this data are doing so for their own commercial purposes. As a consequence of that, they're collecting what they need to collect not what you might want to use and the effect is that the data as it is collected can be very verbose. By verbose I mean there's an awful lot in there that is being collected which isn't relevant of any use to you as a researcher. And the problem with these datasets if you have them available to you is that you've got to take either the whole dataset or leave it. So it'll be down to you to reduce this dataset in size and to extract the bits of the data from it that you actually want to use. Now to give you an example of this, let's go through a little scenario where let's imagine that you're interested in Twitter data. Oops, I think I've lost that there. We're interested in Twitter data, so we'll look at tweets. At the bottom of the view we've got something like a log scale just a little guideline to give you an idea of how big and small these things are. So looking at your tweets we have a cent tweet is very small less than a kilobyte. But the actual data associated with the tweet can be several kilobytes in size, perhaps four or five. If you are interested in all the tweets from a user over a period of years say and Twitter's now just gone ten years old of course you can have quite a lot of data involved. And if you're interested in analyzing say the tweets from a user and all the friends of that user to repeat the tweets and so on and so forth then we could be talking about tens of kilobytes of data. If we look at this from the other point of view of say smart meter data this is a commercial energy company say collecting usage of gas or electricity or whatever and storing it. So across many many households and businesses all over the country you end up with a fire which can be very large. But as this is typically collected in half hour intervals which isn't really much typically what you want to use we could aggregate that by day and that will reduce the size. And then if we said well we don't need to buy a day when due by month we can reduce it further. And if we say well I'm only interested in a certain geographical area then again we can make it smaller. But the problem with all this A is that if I was to draw an arbitrary line let's say about five gigabytes and suggest that anything below that line your desktop application can cope with and you will happily continue using your desktop application even if it's a bit sluggish towards a larger end of the scale then that's fine. But there gets to be a point where you end up having to use a big data environment because this is simply too big. So anything in this great region here if that's what you've got and that's what you're being forced to start off with then you need a big data environment to deal with that. And the one we're going to look at or use part of is Hadoop. So as I've just said if your data set is big at the start then you need a big data processing system to work with it and Hadoop is the one for us because within Hadoop we needed to do two things we needed to store our data sets now in theory storing big data sets isn't simply storing them isn't a problem because if you think about it you can go now to Amazon and get four terabyte disks to put on your NAS storage just all your videos and what have you and that's definitely in the big data scheme of things the problem may is having got it onto your disk you can't do anything about it you can't load it into your desktop applications to process it so we also need a way of processing our data sets so storing data set in Hadoop is almost transparent Hadoop uses something called HDFS Hadoop Distributed File System and what it does actually put out Hadoop cluster itself can consist of hundreds or even thousands of computers all working together and each one of those computers has its own disk file store just like you have on your PC or on your laptop so what HDFS does is it chops up your large data set and gives chunks of it to a whole load of these different processes so imagine your nice data set or chopped up into bits and distributed to the forewinds almost well fortunately lower HDFS doing that you don't need to care about this so as a end user a file system is a file system you're just going to use your files and you're going to put them in directries just like you do in Windows Explorer which is good but processing the data set in Hadoop again there are several options available to you there's the likes of Spark, Tez, Tejo I think quite a new one Pig and there's Hive which are the more traditional ones and today we're just going to look at Hive in particular so what is Hive? again if we use the actual definition from Apache.org who actually distributed Hadoop and Hive and several of the other products as well that I listed before the official definition is the Apache Hive data warehouse software facilitates querying and managing large data sets residing in distributed storage Hive provides a mechanism to project structure onto this data and query the data using an SQL like language called HiveQL now what does that mean to us? all we need to know from that is Hive will make our files look like tables we can use simple bits of code to take extract data items from the tables i.e. make the data files look smaller and that's pretty much what we're going to do in our demonstration today now a bit of background to the actual demonstration the environment we're going to use is something called the Hortonworks Sandbox Hortonworks is just a commercial company specialising in big data and the Sandbox is some software which is going to run on a PC or a laptop and it will look like a Hadoop cluster it has all of the functionality of a Hadoop cluster just not the performance of a Hadoop cluster but all the things that we need to use are all going to be in there the second thing we need to note is that because we're looking at Hive today not HDFS and other things we're assuming that actually getting your data sets into Hadoop has already been done now for both the environment and the loading of the data on the UK Data Service website we will be putting up sets of instructions on how to get hold of the Sandbox and install it on your PC and how to get hold of the data and instructions needed to load that into HDFS so that it can be used in Hive and hopefully they'll be there by the end of this week the actual data we're going to use is some smart meter data and again that can be downloaded from UK Data Services with the search number in the brackets there it actually consists of several data sets but the ones we're going to be looking at are the geography and the gas readings data set so the geography data that we're going to use when you actually download it is about one half megabytes in size now this is a small data set there's no doubt about it you can easily load this into Excel and that's exactly what I did I loaded into Excel and I've actually chopped out a few of the columns in that data set which I wasn't going to use it fits into my desktop application I'm going to use my desktop application it's as simple as that so the actual file that we're going to upload we've uploaded into Hadoop it's actually less than one megabyte so a tiny tiny little file for Hadoop purposes the other file the meter readings for gas is 6.8 gigabytes in size and has 246 million records in it no way can I load that into Excel so here it's got to go straight into Hadoop and any kind of processing or manipulation I'm going to have to do within the Hadoop environment in our case using Hive so just to describe these files in a bit more detail the geography file this is our cut-down file it has columns if you like or fields one is called a non-ID and this is just a number which represents some household or other it's also got a set of acorn columns which describe various demographic groupings I'm going to show you that in a minute and the nuts one and nuts four are geographic type regions so the nuts one is quite large geographic areas of the UK so these are the acorn structures so it's got 6 categories, 18 groups 62 types and you can see here the first ones affluent achievers life stars is the first group exclusive enclaves is the first type and so on and so forth you can download the full document from the internet you do a search on acorn demographics we'll probably find it just do acorn you'll probably find out a load of stuff are oak trees so don't do that the nuts one the nuts one values are large areas of the UK broken down so the UKC is actual code which appears in the file and represents the north east of England that's not actually a complete list because certainly in the file we've got some UKM values which represents Scotland and I imagine there's one for Wales as well but there's none of them in the file so looking at the readings file this is the gas readings file 246 million records of it so again we've got a non-ID it's exactly the same as we had in the geography file except for some strange reason it's spelled differently in this file that's nothing we've done that's just the way the file comes the second field advanced time is effectively a time stamp for when this particular half hour reading was taken and we'll use that quite a lot to do the aggregations the third column HH is the half hour period so that ranges between north to 47, 48 half hour periods in the day we're not actually going to use that at all so in a sense that's an example of the verbosity of this file just a tiny part of this particular file but you can imagine that in another file there might be loads and loads of columns which you've got to accept which you don't really need to use the last one of course kilowatt hours is the recording of how much gas has been used in that time period and we'll obviously need that to work at how much gas has been used on a daily or a monthly basis and so on so hopefully on with the demonstration now what we're going to do these are the sort of steps we're going to take we're going to create tables from the data sets because that's what I have those for us we're going to explore the values from the geography table I'm going to draw a little simple bar charts just to show the diversity of how those groupings are groupings work on the gas readings gas readings file we're going to remember these are starting in half hour periods so we're going to aggregate them up to days and then to months and then finally from the months we're going to take all of the data from the year 2009 just to make the graphing look more meaningful when we do that later on we're also going to from the daily gas file we're going to create a summary table which will contain the minimum the average and the maximum gas usage for each individual households across the entire date range of the file and finally we're going to join the gas readings file and the geography file because the gas readings file only has is only going to tell us how much gas a household has used if we want to do any kind of meaningful analysis with that data we need to know something about the circumstances of that household and that's coming from the geography file so we need to join them together if we are going to subsequently go on to do any kind of analysis we're not going to actually go that far today what we're going to do is in R we are going to look at this join file and just draw some simple graphs and in Excel we're going to look at the summary data table and draw a scatterplot chart and investigate some outliers and how we're going to get from Hive into our desktop applications we're going to use something called ODBC ODBC open database connector it's another little piece of software you can download it from Hortonworks you can download versions from Microsoft it's always free and it's used to connect typically databases which don't have their own GUI graphical user interface to some other application which is going to provide those facilities for us so we're going to download the tables from Hadoop using ODBC into R and into Excel so for the actual demonstration I'm going to show you for the demonstration what we're going to do is we'll obviously be using Hive but we're going to use Hive via a product called Zeppelin now both Zeppelin and Hive are included in Hortonworks sandbox they're already there for you when you set up the sandbox environment and obviously for the last two parts I've just said we're going to go out to the desktop and use R and Excel and connect to Hive for the ODBC connector okay demo time so starting with before I even get there let me just show you when you actually load your sandbox up in my case I'm using VMware's workstation Oracle's VM virtual box is another option but either way when you finish loading up your VM you'll end up with a nice black and white screen looking very similar to this and the only relevant part of it that you want is this reading this address off here and the reason you want that address is that is the address you're going to put into your browser I'm using Internet Explorer Firefox Chrome they'll both work so when you put that address into the browser this is the initial page that comes up what we want to do is click on the advanced options there and down at the bottom we have Zeppelin and again it gives you a separate IP address for Zeppelin but it's also a link so you can actually just click on it and it'll take you into Zeppelin and this is what Zeppelin looks like and the way Zeppelin works is that it allows you to create what's called notebooks so if I get a notebook here and say create a new note it gives you some silly name to start off with which I'm looking for the changing you can change it to what you want you say create note and nothing appears to happen and the reason nothing happens is because it doesn't actually load a note so if you go back into notebook now you can see down here if I click on it now I get this now this is an empty note and what I'm going to do is I'm going to run it a very simple query now because Zeppelin is a general purpose too you can run other types of environments in there in order to tell it that I'm going to run Hive I've got to type in Hive first and then I can type in a standard Hive QL type query now what this is sample I'm saying select all of the columns from table sample 07 which is a default one provided with the sandbox it doesn't have any useful information and the limit 12 says just return the first 12 rows and then if I was to run this this little button down here it will come back with the results of that query the first 12 rows so the notebook is split into three parts the top is where you write your query the bottom is where the results are shown for you and in the middle after you've got some results you get a little toolbar here the first item on the left is just show a table which is a default as you can see with a table coming up here but you can also change it to show simple little graphs not necessary always particularly useful graphs of something or virtually nothing you can change the settings to change it so if I was to go back here get rid of that and put salary in there then you can get a little bar chart like that so very simple visualizations I should point out that these visualizations are part of Zeppelin they're not actually part of Hive as such but we'll be using this for our little notebooks later on so just a brief overview of how Zeppelin is going to work so now we want to go into our own Hive notebooks which I've created previously to do the various steps in our demo today so the first thing we're going to do is make Hive think our data sets are tables and the way we do that is we create an external table we give it a name geography we list all of the various columns that are needed in that table and these represent the columns which are actually in the file we need to tell Hive that each of these columns are separated by commas we need to tell exactly where this file is located and in this particular case we need to tell Hive that this the file actually has its own header row which you need to ignore so all we're really doing is we're giving Hive the representation of our table and letting Hive when it needs to process a table knows how to actually find everything and treat your data set as a table which is what we want to do so that's the geography table it's exactly the same process for the gas table here I'm using the location of the gas data set is in user slash Hive slash energy it's only got the four variables as we discussed before they're all separated by commas and again ignore the header row now when you actually run this query I won't do it because I've already created these tables but you can see at the bottom here took one second took one second now I've already told you that there's 246 million rows in this gas table so to create this in a second or a few seconds seems pretty impressive but the reality is that all Hives are actually doing is making it like a mental note of how this data set has to be processed when you do proper work against it it's not reading in 246 million rows of data it's just associating this description with files in that folder okay so moving on we've created our tables the next thing we wanted to do was look at the geography file and so up here what we're going to look at are the acorn classifications what we want to do is for each value occurring under the acorn category column we want to know how often it occurs and what the value is so if we run this what we end up with is a little table here so if you remember the suppose of these 6 categories 1 to 6 which is certainly the bulk of our data here but for some reason we've got 9 which have a category of 0 which doesn't make sense and we've got 1 which is 0 which again doesn't make sense so you can use this simple technique here to decide if you've got null values missing values or so on or perhaps nonsensical values things that you weren't expecting and we can do exactly similar queries with all of these other variables in the geography table so the next one down is I'm doing it exactly the same but here I'm using the acorn group the rest of the queries is exactly the same what I'm showing down here though instead of showing the table which goes from A to Q if you graph that as a little bar chart then you can get a feel for how well represented each of these various groups are so there's clearly A, B and C are well represented H and I are but there's vast differences between say A and D which is something that you may need to be aware of when you're considering what analysis to do and again that's just exactly the same this is on the acorn type where you've got far more groups to worry about and then we do exactly the same thing on the nuts4 count so this is a table and there's quite a few this is quite small geography I don't know exactly what they represent in terms of size but there's quite a few of them and again I want to spread that if we do that for the nuts1 where there's only that handful of descriptions UKD, F and what have you again you can see there's a vast difference between the nuts how many values of UKF we've got in the file as opposed to UKD and at the end here we've got UKM which was missing from our description table that I know is actually representing the whole of Scotland back to our other query so let's see the geography file looked at how good to start doing our aggregations of the of the gas readings file just before we do that just to confirm what I was saying before if we run this query here select count star from all gas it will tell us how many records are in the file 246 million 482,700 as we go down this file this little notebook that we have that's just a listing of the first five records in that file so you can see here on the advanced state time you can see the format of this it starts with day day month month month a colon and then a time in half hour periods obviously they're all just 80 o'clock and moving down so the first thing we want to do is aggregate this file into days and this is the code we're going to use to do this we're going to call the table gas days we're going to use what's called a substring of the advanced date and time field now substring is a function it's the same function as you'd find in excel so I'm taking the advanced date time field and I'm saying start position one and take the first seven and if you remember what we've just seen that's that's the date I'm just going to chop off the time at the end I want to sum the gas because each half hour represents the gas using that period so to get a daily reading I need to sum them all together at the same time I'm going to keep a track of how many readings how many rows in the file are actually being associated with this aggregation and the aggregation is going to be done on non-ID so the household on the household and the advanced date time field so what I'm going to end up with is a table which for each household on each day for which there are records it will show me the sum of the gas used and how many readings were associated with that day simply running that so if I look at how many records I've got in my gas days file I've got five to server five million I've already reduced it quite considerably and I can look at the first few records in the gas days so this is for household number one these are various days for household number one these aren't necessarily in any order there may be lots of other dates for household number one in the file later on we've got the total gas used and the number of readings 48 is what we would expect to see there being 48 and a half hours in a day but if I now run another query which asks how often each of these count readings occur in the file and so this is a very similar one we're doing for the geography file and graph that we can certainly see at the end here we've got four just have a four million of 48 we've got nearly a million and 47 almost everything in between has some little value in there so this anything not 48 represents a day for which there are readings missing so what we're going to do is we're going to drop that table what we're going to delete that table and start again and the way you delete the table and how it is just say drop table and then give it a table name and I delete it for you what I'm really deleting is well in this case I'm deleting the whole of the gas days table if I had done that for all gas all I'd be deleting would be the description I wouldn't be deleting the original data for all gas the readings data ok so then all I'm going to do is recreate the table with this query here and this is exactly the same query as we had before except that I've added this line here line means I only want you to put into the resulting table records where the counter readings equals 48 i.e. I've got a full days worth of readings in there so now I've got the 4,173,000 and they're the first 5 which is redundant because it's exactly the same 5 as we had before and the next thing I'm going to do is combine this into months so the format of that query is exactly the same except I'm using a different substring because now instead of picking up the whole date I'm just picking up the month part of the date and again I ran the query to see well what how many days in a month have we got all the way down here and so lots of 31s and 30s as you expect we've got lots of a fair number of 29 because the range includes a leap year and 28 but again you can see there are lots of records which don't seem to be have before complete months so you may want to get rid of them depending on what you're doing in this case we're just going to leave them in just for show but we're now down to 148,893 records the next query I'm going to run show you is just creating the gas month of 9 table and this is going to take what we had before the gas months but I'm only interested in the year 09 on the answers again another just variation in the substring to select the 09 I'm now down to 70,106 records and that's what the first few look like so you can see that for media that first record only has 26 readings so it's not a full month's worth there okay and the other query which we need we need to create our summary table which we're going to use later on and you can see here we're creating a new table we've got a minimum we want the average and the max these are all built-in functions just like you'd find in Excel we're giving them new names so that they look nice in the table and we're just sorting them and gripping them again by the non-ID so for each a non-ID household there will be one record in an average and a max of the gas usage on a daily basis across the entire time frame of the dataset and then you can see what some of them look like down there the next thing we want to do is do our join of the table and this is the query which does the join I'm just going to switch back now to describe that join process in a bit more detail so we'll go back to the slides for a minute these are just a couple of slides explaining the zeppelin but we've described that in the demo there so we don't need to go through there so joining the tables we've got two tables that we're interested in the geography table which is the original geography table and the one that we've just created GASMO so nice now if you've got to join tables they've got to have one column which is common to both in our case we've got a non ID on the geography and a non underscore ID on the GASMO now despite the names we know they're the same thing so that doesn't make any difference to us in the geography table the non ID occurs once and once only for each household because it's just a single record describing the geography and demographic of that particular household on the GASMO so nine table however we're actually going to have typically we're going to have 12 rows of data for each in non ID because there's 12 months in the year 2009 and there should be one row for each one it's not guaranteed because you don't know what data is missing but that's what you'd expect and so again this is the full breakdown of the listing of that query so we're creating a table we're going to call geography this section here is a list of all of the fields that we want from both tables now so the ones prefixed with A are coming from from the GASMO nine table and those with B come from the geography table using these what are called aliases or new names for geography and GASMO nine is just a convenient shorthand for prefixing the column name so you know what table they've come from and also in a lot of cases if you're joining tables you're joining them on a field where the name is actually the same field name so you do need some way of distinguishing them on the last line here where we've got on B a non ID equals A a non ID if a non ID had been written the same in both cases we would need the A and the B to distinguish which one we were talking about and when we've run that joined what we end up with is a single table, a GASO nine table and in there on the left hand side we've got the columns which came from the GASO nine GAS months and on the other side we've got the columns which came from the geography table so we're now ready to look at the last part of the demonstration which involves using R and Excel and we'll start with the R what I've got here in R Studio is just a little script of R I've run the first part because it's just setting things up the first line that we're really interested in here is this line here where I'm going to use my ODCB connector to connect to R and when I run that I just get a little variable up here called channel you can't really look at that it's just telling me that it's got a connection this is really the important one because here what I'm going to do is I'm going to run a query across my ODCB connector to select all of the rows from the GASO nine geography table and I'm going to return to the data frame called DT1 and if I run that for a few seconds I can see I've got DT1 up here and I've got 7106 columns rows in an epic part which is what we saw before on the on the Gatsman's table just a little bit of housekeeping down here I'm just going to copy the data frame before I play with it I'm going to do a bit of manipulation here to actually include in the data frame this new data frame a properly formatted date variable because I need that to do the graphing and then I'm also going to do an aggregation into DT1 so that I can draw graphs based on the nuts one value so you can see here my DT copy is the same number of observations as DT1 except I've got this extra variable at the end which is called DATEN and that's what it represents so now I'm going to do my graphs so the first graph I'm going to show is a blank graph of nuts one and here you can see it doesn't really tell you much because we haven't taken into account the number of records if you like relating to each of these nuts one categories but what it does clearly show and what we already knew was that people use less gas in summer than they do in winter ok and the second graph is a bit similar but this time we're going to use the ACON categories of which there's only 6 or I suppose there's only 6 and again you get nice little stacked chart again of really just telling the exact same thing you use less gas in summer than you do in winter so that's it for R so moving on to our Excel system what I've got here is I've just opened an Excel spreadsheet there's nothing in here at the minute I'm going to go to oops to data and say from other sources Microsoft query and here I'm going to pick my Sample Hortonworks Hive DSM which is my ODPC connector if I click ok there it asks me for a password put in I can test it and it tests ok as soon as I say ok it will give to Hive and get a list of tables for me and the table we want to use is this Gas Day Summary table that we created especially for this purpose I'm going to select that I want all of the columns I'm ok with everything else I'll just default everything to finish where do you want to put the data not there start in the corner up there and it will now go to Hive get the data from that table and it will download it and put it into into here now what I want to do is from there I want to draw a table a graph in this case a scatter this is actually only available in Excel 2016 I believe but never mind if I make it a bit bigger oops so you can see it you will immediately see at the top here I've clearly got a rogue outlier here point 11840 22,120 and if I was to go back to my table and sort it you can see that coming up to the top ok so you can and there's a few of these here which might be a bit suspicious as well and just just finally what I did I did this earlier and using exactly the same data but I've cut out all of the values maximum values above a thousand and if I draw that graph again which draws itself again you can see things are far more spread out than they were before at the bottom but what I found interesting was this line here just below 800 there's very clearly almost a straight line drawn across there an awful lot of an unusual amount if you like of entries which all appear to have that same value and if you go down and look at the table you can see for some reason there is quite a range of records so these are the daily maximum values for these households on the left and not only do although so many which have that same maximum value there's quite a few which seem to have the same minimum and average value as well so I think that would be worthy of further investigation before you delve into any serious analysis that is the end of the demonstration and there is in fact the end of the webinar so in summary what have we got we may have no choice in using big data especially if you don't control the source if you're getting it from somewhere else if they've got it for their purposes you've got to use it as it is and you've got to be prepared to slice and dice as we say the data sets so as to reduce it in size always remember that Hadoop and the big data tools that you find inside Hadoop like Hive, they're just tools for you to use to manipulate the data as you need to do because in a lot of the cases all we really want to do is cut it down to size so you can take it out of the Hadoop system back into your desktop where you're probably going to be a lot more comfortable in doing the analysis and so on and so forth we've demonstrated the desktop using R and Excel but of course SPSS or Stata or any of your other desktop applications will work just as well if you want more information on Hive or even Zeppelin there are websites which you can go to on the websites there are if you want details about that on syntax of Hive you can get information on that if you want to look at the basic tutorial of Hive that's also available on the website if you want to read a book there is a book, a Riley book called Programming Hive which is quite a very readable book but it actually contains an awful lot more information, more in-depth complex information than you'd probably want to be using at the moment but there are sessions on creating tables and things like that which you might find quite useful any questions so thank you again for attending, thank you Peter for presenting it was very interesting and have a good afternoon everyone, bye bye