 Hello everyone and welcome to this UK Data Service webinar with Introducing Databases. I'm Marguerita and I'm a Senior Communications and Impact Officer and presenting today is Peter Smyth, he's a Research Associate working for the UK Data Service at the University of Manchester. Thank you Marguerita, thank you everyone for coming. Today's webinar is about databases and what we're going to look at is we're going to consider the definition of the database. We're going to look at why people who use Excel, and quite often people do use Excel as a database, why that may not be quite good enough for what we want to do. Then we'll talk about relational databases with a bit of the history and some examples of using those databases. Then we'll look at document databases, by which time you'll understand what a document database is, and then we'll have a quick demo of that and then follow that at the end with a demo of a graph database. So, what is a database? Definition taken from the Oxford University Press, a structured set of data held in a computer, especially one that is accessible in many ways. So, the key points here are when it says structured, what do we mean by structured, do we mean it's ordered in some special way, or arranged in a special way, or what. And then it doesn't actually tell you anything about how it's actually structured, so it's open to interpretation, let's say, and certainly different people on a different database organizations will have the different views on what structured means, and we'll come across that later on. Accessible. Well, that's pretty clear. It has to be in some way searchable. If you're putting data into a database, you need some mechanism for getting the data out again. And the whole point of the database is that you put it in in such a way that you can know easy ways of searching the data to get it out in certain orders or certain ways of grouping the different data types together. And we're going to use a series of different queries in order to do that. And the different database systems that we're going to look at will have different ways of querying the data. So, what we've got is it's got to be structured and it's got to be accessible. So, pretty clearly this is not a database. It's unstructured, it's thrown in, and you'd have a great difficulty searching that, you'd have to tip it upside down and unscramble all the papers and so on and so forth. Not a database. Now, on the other hand, some people do use Excel as a simple database. And why do they use Excel? It's because the worksheets in Excel are tabular nature, which makes it a very rigid structure. That takes one of the boxes. You can join sheets together, join effectively different tables together using the lookup. In fact, database type functions in Excel built in for many, many years now, which allows you to do some things and count things and so on. These are all things which people will do with a more conventional database and already built into Excel. You can also write queries to filter the rows. I'm not just talking about the little drop-down query by. You can actually have complex queries, but it is quite tricky to set up. And so, this really leads on to the sort of the disadvantages of Excel. You're limited to a million rows. Now, that could be quite a large database in some people's eyes, but in today's big data environment, that is becoming increasingly a limitation, which people will quite easily surpass in their own collections of data. And then the law restrictions on these, the VLOOKUP can only turn a single column, which is typically not what you want when you're trying to join tables together. You want multiple columns from multiple tables potentially. The database function can only turn a single value. Functions can only turn a single value. Again, not typically what you might want for proper analysis of the data, or it represents rather limited analysis of the data. And certainly setting up the queries can be quite complex and not particularly intuitive. So why use a desktop database? I'm emphasizing the word desktop here, because we probably all know that there are large servers around very large databases. But of course, that's outside of everyone's normal reach. So what we're looking at for it to use a desktop database is that the size of the data is it too big to fit in the likes of Excel or Stata or SPSS, the sorts of tools that you might commonly associate with doing analysis with. There's a convenience of the desktop system. You're already using your desktop or your laptop for your ordinary everyday work. So having all of your data in there as opposed to in some external server obviously has some benefits, convenience if nothing else. There's the flexibility in collecting and persisting the data. What I mean here is that you're not necessarily going to come across all of your data in one file, in one fell swoop, and there it is. Over the period of your research, you may be adding to your data in a regular manner or even an irregular manner, but you want to some way of keeping it all together, making sure you don't lose the bits that you've got, and just being able to gradually add to the data to build up a larger corpus of information of data. And then we've got the key selling points of databases, the flexibility that it's going to give you in query and analysis. So what I'm going to show you now is a couple of slides which I used in an earlier webinar on basically on her dupe when we're talking about the growing and shrinking of data. I'm just going to whiz through all the little bits in here and then describe it at the end. Basically what we have here is on the top we have tweets which we're going to grow the tweets, so the amount of data involved in these tweets, right up to the point down here where the tweets or the volume of data you've collected is so large that you can't conveniently process it in your desktop applications. So again we're talking about the SPSs and the starters, and on the bottom half we're assuming you had a large set of data and you may have decreased it. It's really the point about this is if we're growing data, it gets to the point where you can't use your normal desktop application and at the time we suggested you use your big data system like a dupe or whatever. So what we've got today is a slightly modified version of this slide which works very much the same way as we build the data up. It's really the building of data up that we're particularly interested in in this session because as I say if it's already too big then it's too big. And here instead of having a single line at 5 gigabytes we've got a line with 5 gigabytes and a line about say 25 gigabytes and we're saying desktop applications as we did before. We've got a big data environment as we did before but this middle section is potentially amenable to a desktop database. Data which is too big for your normal desktop applications it's perhaps too small, well it wouldn't be too small but it's inconvenient to have to go to a big data environment so in this little gray area down here we have the potential for using a desktop database and that's what we're going to consider your options for using desktop databases in the rest of this webinar. So four main types of databases we're not going to look at all of them but what we have are the relational databases which is by far the majority of them. Document databases, graph databases and what are called wide column stores and we won't be really considering wide column stores. We'll look at relational databases, we'll look at document databases and we'll look at graph databases or a graph database. And so just a brief description of what these entail. The relational ones as I say predominate by a long way, data is held on tables and the relationships are defined between the tables to enable you to join the tables. In the document database and the wide column store databases as well to this next end they use storage architectures which are designed to overcome limitations and scalability problems of relational databases and since big data sources have become available i.e. the volumes of data are increasing then these are gaining popularity. And then the graph databases and these are designed to optimize specific types of queries. This is where you're more interested in the relationship between terms or items in tables than actual attributes of the items. So the obvious example I always think of is in social networks like Twitter and Facebook and what have you when you have followers and followers have followers and you want to know who follows who follows who and so on and so forth. So there's a lot of research tab questions which you can pose which fall into this category and by far the most efficient databases for answering those kind of questions of the data would be a graph database rather than a simple relational database. Just to give you some background there is a site here dbengine.com ranking which will actually show you on a monthly basis. I'm not going to show how to do it monthly. All of the various databases and using how popular they are. So you can see from this table as I was saying the top the highest ones I don't know what the score actually is so these are just relative positions if you like but you can see the top three all relational databases. MongoDB which we'll be looking at is the highest non-relational one which is a document database and then we've got Cassandra which is a wild wide column store and then not very much further down. We have Neo4j which I think is the top rated graph database and you can go through this table and look at all the different types of databases you can get. Most of the non-relational ones there are a few which have things like search engines and Solas probably search engines. Some of them are in fact more specifically tasked databases as opposed to the ones at the top here and all the ones that we're looking at which are far more general use within that specific area of data that we're going to be talking about. So just to summarize that table I've got here a table of the various databases and I've listed these because these are all freely available. You can load all of these onto your desktop if you like. I'm not sure if Cassandra would work too well on the desktop but certainly the examples I'm going to show you from MySQL, Neo4j and MongoDB they're all loaded on my desktop that's where I'm going to run the queries from and if you go to these sites you can find instructions for downloading them installing them and tutorials on how to use them if you need more information on that. So let's go on to the relational model and we'll consider why do we have it, what's it good for, pros and cons and what do we mean by relational. I'm going to try and answer those questions not necessarily in that order but over the next few slides. So let's have a look at a bit of the history. The term relational first used by EF called 1970 a relational model of data for large shared data banks and although it wasn't necessarily the main reason for developing relational database it should be that at a time computer storage was very expensive that is disk storage, where you store your data it was very expensive and one of the advantages of the relational model is that it could be very efficient when you're storing data because it only stored each item of data once as opposed to repeating the same data across different records for different reasons. And just to prove the point or illustrate the point of how expensive storage was back in this graph here if you go back to 1980 the cost of storage $193,000 per gigabyte this is a log scale on the y-axis and by 2014 it dropped to three cents a gigabyte. So it gives you an idea of how the obviously the cost has come down so dramatically but you can imagine that the priority in conserving and being efficient with data storage has also waned in that time as well. Back in 1980 every byte you could save was money saved money in the bank whereas when you get down to three cents a gigabyte frankly you can store anything you like and just keep it there it's costing you virtually nothing to store the data and that's not only effects the relational model and why that was developed but it also goes somewhere to explaining why when we talk about the the document stores and the MongoDBs why we don't care that there seems to be a lot of duplication and the information as we come to it. So let's look at a relational model how it works. Let's imagine I've got some information on a household here I've got various items about the household this is just an identifier address the postcode we've got people who live in this household we've got something about the house itself number of rooms we've got a type of house construction you know brick semi-detached that sort of thing. Now if you wanted to create that put that into a table what you would end up with is something like this and you can see from this this is all of the same information laid out in the table as you commonly see the point about today is this section here and this section at the end where we're talking about the house and not the people who live in the house it's effectively being repeated for each member of the household which means it's just wasted space it's repeated information and back in 1980 that would be very much a no no we couldn't afford to do that sort of thing. So what the relational model does is it breaks it down into sets of tables where each table is linked to the next table or another table by some common key so oh household ID here is the same as the h household ID here and that's what they're joined on so you can use these joins to reconstruct that full table if you wanted to but the advantage is that if you don't need all of the information well not only well there's two advantages one is that when you split it up into tables the information about the house itself only has to get stored once because all of these i all of the five persons refer to the same set of information about the house such only stored once the other advantage is that or another advantage is that if you were just interested in say perhaps names and addresses or even information about the types of builds of houses then you don't have to touch the other tables so you're going to restrict how you access the data you're going to access less data therefore it's going to be more efficient so let's just go through some of in more and that is in more detail we've got it's only stored once it's certainly efficient for well-known and structured data that's data where you know in advance the column names if you like of what it is you want to store and you know that when you get a new observation all of those fields are going to be there or potentially all of those fields are going to be there and you can just put another another row in your table to to accommodate that um querying tabular or structured data like this is very well understood and we have the structured query language sql which is almost universal for querying structured tables like this and it's relatively easy to learn and of course you can use it up any relational database um the other advantage is they all operate on a system called schema on right and what this allows for is data checking when you're loading the data because scheme on right means you've got to have defined your table definition in advance of putting any data into it when you do get around to putting the data into it you can do lots of checks like if i said this this this column has got to hold an integer and i try to give it um an address a house address then i know something's wrong and there are various other things you can say well this this column can't be blank if if someone tries to put an empty value in there then i know there's something wrong without observations at that time as opposed to finding out later on when you do um edl and data and so on and so forth so it can be an advantage is when you're trying to do that and it makes for certainly makes for cleaner data when you've loaded this uh the all disadvantages of course um multiple tables increases the loading time because if you now take a single observation you may have to split the data in that across several tables or four tables as we had in the example before and that obviously takes a bit more time to actually load the data perhaps that's not something that's a concern perhaps it is um vertical scaling is a problem um this isn't really relevant to our desktop situation because your desktop is your desktop is your desktop but in larger server environments the problem for relational databases is that if you want to make them bigger and faster you've got to house them in bigger and faster server boxes and it gets a limit to how how well you can do that on the other systems that we look at um because we can scale um horizontally are you just add new servers it makes them a lot more scalable in the long term but like i said for for our discussion today it's not really relevant because we're just sticking to our one desktop desktop desktop anyway um we've got the scene on right just explain that's you you can't deal with unstructured data because you've got to pre-define what your table is in all the columns in your table you can't have data which comes along which has a new column in it or the columns are in the wrong order so that can make it make it very difficult to deal with unstructured data sources and that is increasingly what we tend to find in the modern world with all of the various data sources coming from the web web APIs and so on and so forth so it does does have a bit of a restriction there so i'm just going to show you a little demo of relational database well what i'm going to find the system let me start this like i said you can get this from the internet you can download it and it um it's all free it comes with a nice um user interface log into okay and watch that i've got it i've just installed um a few little tables down here i just want to emphasize a couple of things on these very quick examples what i'm going to do first of all is i'm going to create a table called city 2 and i've got five columns in it id name country code this district and population and you can see when i'm defining these column names in addition to the column names i'm saying what kind of information i'm using the type of the information so here i'm saying it's an integer which has up to 11 um digit in it here i've got a character up to 35 digits in it other things i'm saying i'm saying not null i you cannot load a record into this table if the country code is missing the other thing i can do is i can give default value so i can say it's not null but should you actually get something which is a null you can set it to zero instead or set it to space instead or null instead but again you have to know that's been done because that's not the same you're effectively altering your data if you start doing that when you put it in so you have to be you have to think it through before you actually start do setting these tables up and know what your data is going to look like so having um create let me just create this table city 2 um and show that you can see city 2 you can expand city 2 and see all the columns i've just created and so forth there's no obviously no data in this at the moment um which i'm verified by if i run this simple select statement select everything from city 2 and there's nothing in there which is what we expect i've just created it if i insert some data into city 2 with this statement here i'm just going to insert these three rows at the bottom very simple rows um i know we'll be able to look at it again in practice this isn't very practical loading data this way what you would normally do is bulk load the data using statements like this load data from a file and into the into the table but the point is you can only do that if you have already defined the table the table must be defined first in in these relational databases so if um having inserted my table i can i can run the queries again so if i run that one again i've now got my three rows in there um i've got a couple of other queries here based on my other tables the city country and country language and here i can actually do an aggregation of the data so here i'm looking for the head of state from the country table and i want to know how many countries the head of state is head of state for and if i run that we find that elizabeth 2 is the head of state in 35 different countries if i run this one here i'm joining two tables together so i'm joining the city table here and i'm joining the country table together and these fields at the top are the two i want to from the city and i want to from country if i run that i get the results that way okay so writing the queries on a relational database can be very um quite straightforward and simple so now back to the slides um let's see okay so moving on now to the document database again same set of questions what do we have before is good pros cons what's meant by document that's a bit different so what we have first thing what is not a document is not a pdf or a word document it's a semi it's a something which is contains semi structured data what does semi structured mean well structured in that every data item in the document um in a document in the sense you would think of in terms of a record or a single observation or something like that has a name associated with it and it's semi in that different documents in the same collection of documents i.e. say say um you've got observations from a survey that so the survey is the collection and each individual response is one observation they don't all necessarily have to have the same set of answers in them or the same set of questions which have been answered in them semi structured data data is almost invariably displayed in or provided in jason format and this is a data format and it's certainly the most popular in for semi structured data and eventually anything that you download from a web based api will be in jason format or you'd be able to get it in in jason format so the idea is that we're looking for some American efficiently process this kind of data and just to show you what it looks like here's an example of some data formatted in jason so everything between the curly brackets is a single record or observation or document and it's got three items in it it's got a name which has a value of manchester postcode so on established so on all of the it's always in the format of some kind of key name here and the value for that key name and you can see it's probably quite clear that when you get around to searching or querying this type of information if you say name equal something it's going to come back and say manchester you don't actually do quite like that so the semi structured nature means that it's it can be difficult to store on tables and the reason that is is because not all of the fields need to be in each document in a sense that's not a great problem if you actually knew what all of the fields possibly could be and the reality is you don't know that and even if you did it could mean for make for very sparsely populated tables and of course the fields don't have to be in the same order so here's a little example of two there's two documents here the first one has three three values three column or three items data items in it it's got id name and telephone number the second one which is in the same collection has the name and id in different order and it's also your email which happens to have two values in it and it's got a mobile number which the other one didn't have so but they're both perfectly valid to be stored as part of the same collections and you can see from that otherwise it's almost impossible to create a schema for this in advance and because of this when you're accessing this kind of data it's referred to a schema on read schema is just the way you'd find the definition of the table in the relational sense so it's schema on read because you can't do schema on write as we did before because that requires you to have written the table definition before you load the data we can't do that here so we have schema on read where you just store all of the data as it comes in as the records come in and then when you need to search the data then you start saying oh well I need something which has an id in it and I need something which has a mobile number and then when you do the search you only get records which have those those fields in them okay oops okay so we're going to talk specifically about MongoDB or show the examples in MongoDB and they don't use SQL to query the data when you install MongoDB you get a simple shell which you can use to query the data the bad news is shell requires you to know javascript really to make that work effectively and it's a little black and white screen it's not really very very pleasant to use but as an alternative if you've got a program if you're used to programming environments like Python or R they both have packages which allow you to interface to MongoDB directly and so instead of writing your queries in this javascript type shell you can write it in Python using Python like constructs and that is is what we're going to have another in a minute but this type of querying because it's not SQL the querying can make accessing the data in these kind of databases a bit more complex is complex and it's more like programming than simple querying okay so we'll just have a little look at what some of these queries can look like in MongoDB which I think I've got up here so what we have here is just a Jupyter notebook in which I've written some Python some of this is very similar to what we showed in the MongoDB webinar a few months ago so all I'm going to do is just quickly run these that first one just sets up the connection into MongoDB and I've told it what collection I'm going to use and it's a collection called Brexit which is actually a series of tweets I collected just before the referendum in June and if you want to know what a tweet looks like I can say show me one from this collection and as you can see a tweet contains an awful lot of well probably the majority of quite useless information but somewhere now we have the text of what someone was actually saying and various other things how many how many tweets did I collect 55,547 what if I want to know fine all of how many count how many tweets were there where the geo field was not equal to none now the geo field is things like GPS information and longitude and latitude i.e. where the the tweet was sent from and if I run that I've got only got 53 out of my 55,000 and that's about 0.1% which is apparently typical for what everyone says is the number of tweets which have geo information in there so having got that far if I now within this 53 tweets I'm going to look for a location of the user in a place called Arrasy which is about 30 miles north of London and instead of getting the full tweet back as I did up here what I'm saying here is I only want you to show me the user the ID of the user the location of the user and the geo information of that user and if I run that what I get back is again this is very JSON like in the structure which is handy for Python because Python can handle JSON very easily and I've got coordinates so I know where this tweet was sent from I already knew that was all going to be the same honestly and I've also got the ID of that particular user sent to it so I can do the same sort of querying type analysis in Python as I was able to do in SQL but it takes a little bit more involved and the query language itself is it's probably just as powerful but it's a bit of a learning curve perhaps so the last thing I wanted to show you is the graphics database for this we're going to use Neo4j which again if you go to the website which I've given you the link for you can download it and store it on your PC quite happily and what you get is a simple movies database that's what I'm going to use for these examples here and it also comes with quite a few tutorials on how you can load your own databases and get information from there so let's just have a quick look at the MongoDB a big one the Neo4j oops that's not the one I wanted this one here and we've got Neo4j I just made this big and when you load up Neo4j this is pretty well the front screen that you get when it's loaded up and if I click on the database there I click on this little asterisk here it will immediately run this query so it's saying match any number return anything with 25 so what it's doing in the background the the data is organized into um various things things called nodes and relationships and properties which are related to the nodes so what I'm getting on this simple query and remember this is a movies database it's showing me a film here and which is a node and then it's got actors or people if you like person which are also nodes and the relationship between the two is written in here so acted in most of these acted in they're all acted in I think down here we've got a more complex one with a few films in and so you can use this to see that the relationships between who that well this guy here person I've never heard of clearly um a director I suppose um this person seems to have acted in these three films and so on so forth it's a nice graphical interface but it's very easy to get it um so crowded that you can't actually see but generally if I change that to 250 up here and run that then it gets really that's just a little bit too much for you to do any use for it but you know you're restricted to just using it using the graph like that you can actually get run other other queries um oops let me just bring it up again so I've got a few other queries here um I'll just run a couple of them and again you can see from you can sort of see from here what we're looking for here we're finding a person which we're calling people that's just a variable name it was related to this movie cloud atlas and we want to know who was related you know what the relationship is so if I run that what I get is for cloud atlas everything for cloud atlas I've got the direct direct it's the multiple directors we've got the producer the road and so on and so forth you can actually have this in a bit of a table information as well if you like and if you really want you can have it returned to you in jason no surprise there really and again near 4j and my sql um both have interfaces to both python and r so you can do this entirely programmatically if you want to I've just got one last query I'm going to run because this possibly demonstrates why graphical databases are considered so powerful what I'm going to do is what I'm looking for here is any connection between Kevin Bacon and Meg Rine they're both people um this little asterisk in here says I don't care how many hops you've got to go through how many different nodes you have to go to find me a connection between Kevin Bacon and Meg Rine and this is all wrapped up in a single um um function here called shortest path so it's actually only going to return one a may have had multiple connections but it's only going to find a one which is the shortest path connection and if I run that a little graph because that's easier you can see that the collection is that Meg Rine acted in Joe versus the volcano Tom Hanks also acted in that and Tom Hanks also acted in Apollo 13 and Kevin Bacon acted in Apollo 13 so you can use this to find connections between um different people and that's uh um depending on the the debt how many links we need might have to go through here that becomes incredibly inefficient in a relational database and this is where graphic databases are so so very quick and useful so in summary what we have is the size of your data may be enough to make you want to decide make you decide on using a desktop database so it's in that middle ground between I can't use my normal status or whatever I don't really want to go into a big data environment or a server environment so I'll try and put on my desktop but it's not the only thing you need to consider you want to know how you're collecting data over time and how it's going to build up and whether it's going to build up with anything big enough you want to know the structure of the data because some data is going to be more suited to a relational database some data is going to be more suited to a document type database like MongoDB if you're not sure of the structure if it's semi-structured data um how do you intend to use the data now what I'm getting at here is that if you were to the ways of collecting Twitter data in in the likes of R R has a package for collecting Twitter data but what it does is it will automatically when I collect the data instead of giving you that long list of of of items which I showed you that python demonstration and that MongoDB demonstration it will actually select a set of fields which it is going to save and it will actually give you the option of saving them into into a relational database because because it's picked fields in advance there's no problem with storm them in the relational table because it already knows what those fields are going to be that's fine if that's what you if you know you're just going to use those fields but if on the other hand you're going to have to um you might want to go back and use different fields then you may have to consider um store the whole the whole of the tweet the raw data and and that may influence how you use it so be careful about um can you clean the strip and strip the data as you collect it so that's what I've just been talking about effectively this is what this um the R Twitter package does it decides for you what it's going to save and clean and keep and keep collected if you don't if you can't do that in advance um you may have to keep all of the data just in case so again in our tweet example that would be the full tweet and again there are packages in python which will do that record the data and you can store it straight into MongoDB and it will store it as a full all of the tweets data and then you can process it in the way that we showed okay so things things to consider before you um choose your database not only deciding whether you want to use the database or not okay any questions thank you everyone for attending I hope you did find it useful and interesting and have a good afternoon everyone bye thank you thank you goodbye bye