 I want to say databases in SQL or SQL databases, a few little things we want to look at. We're going to discuss what are database A's for those who haven't used them before. There are different types of databases, increasingly so these days. I want to answer the question, when the desktop isn't enough? This is a common question I'm asked by people who say, well, why do I need to use databases? Why can't I just use Python or R to do this kind of working? And then we're going to have a look at SQL itself. Just a very simple introduction to SQL, only a tiny little part of what SQL can actually do. But it's probably by far the most important part, especially from your point of view of trying to analyze data from within the database. Then we're going to sort of, the last two points are really by way of demonstrations. We look at how you can actually access a database in various different ways. And starting with a GUI type interface, and then we'll look at some of the programming languages that you can use or different ways of accessing the database itself. So what is a database? Any collection of data, you can call it a database if you like. So here's a collection of data, but probably you wouldn't really think of that as a database. One of the things about a database is it has to be structured in some way so as to facilitate the retrieval of the data in it. So you've got to be careful how you put the data in to make it easy to take the data out. And in most cases, you won't want all of the data that's in the database. You might only want a very small part of it. There are different types of databases, but they all do essentially the same thing. They're all there to store data in such a way as it makes it easy to retrieve them. Not all of the databases or database types are going to store the data in the same way as each other, but there are a lot of the time you don't really care about that. You're only interested in being able to retrieve the data in the database. There are, well, sometimes we don't care, most times we don't care, but there are some things which it can affect, it can affect the efficiency of the queries that we use to access the data, and in some ways it can affect how we write the queries. But certainly for what we're looking at this afternoon, we're not interested in anything. They're not going to be relevant to what we're doing. The different types of databases, they're broadly split into SQL or relational databases and no SQL databases, that shouldn't be there, I don't think. We're going to focus on the SQL databases and where the data is actually stored in what we can think of as tables, tables which have rows and columns in them, and the relational part of the relational database is how you connect two tables together using some common key field. And unsurprisingly, SQL or relational databases use SQL queries to retrieve the data, and that's what we're going to be focusing on today. But just to put it into context, there are other types of databases. These are normally referred to as no SQL databases, and here you're still storing data, but the data is stored in terms of collections of documents. Now, if you look at the little table at the bottom, an SQL table equates to no SQL collection, and an SQL table row is a no SQL document. So what is it that makes a document different from a table row? And that is the complexity of the structure of the data. If we have a table, it's simple rows and columns, whereas in a no SQL document, the structure of the data can be a lot more complex. And as we're not going to be talking about them, I will go into explaining what those complexities are. But basically, complex is complex, simple means tables with rows and columns in this context. Different documents in the same collection can also have different structures from the one next door to it, if you like. Whereas again, in the relational database, all of the rows of the table have to have the same structure. And generally speaking, we don't use SQL queries to access the data. There are some modern systems which allow you to use SQL to access no SQL databases as well. Which is another reason why it's good to know SQL. So tables, I probably already mentioned this, data in a relational database are ancient tables. Think of these very much like the format you might see in a spreadsheet, rows and columns. Each column has its own column name. Internally, a database table is not quite the same as a spreadsheet, but as an image of what the table is like, if you want to imagine what it's like, it's a very, very good image. Here's just a picture of the image, an Excel spreadsheet showing the data that we are going to look at later on in our database. And you can see column names, columns, and rows of data within the columns. Why relational database? You can actually create a database which only has one table in it. And there may be use cases for that. But generally speaking, a database will have multiple tables in it, and there will be some way of connecting these tables together using these keys and relationships between the keys. As I say, it's possible that they could be totally independent, but that would be a very unusual use case. So this is an example of a paper sales receipt, which I found on the internet. There's various little sections of this. Do you imagine this would be commonly used pre-computer days in department stores where someone was buying something so you needed information about the customer, customer order number, name, address, and so on? And information about how they paid for it. It's like cash or cash on delivery, charge, on account. I can't really give it to you. And then the main body of this form is what they bought, how many they bought, how much an amateur would have to pay for it, or paying for those items, and then at the end we go to total. I don't know, received by. So that gives you all of the information related to a particular sale in the department store. If we wanted to store this in a database, we would typically break it down into several tables. Possibly more than I'm actually showing here, but just as an example, we would have a table which contains customer details, names and addresses of the customers. Order details, so this is what they actually ordered, how many and so on, and the amount. The payment details, how they paid for it, and the accounts details over how much they had to pay. And you can imagine these tables being used by different departments within the organization for different purposes. So the customer details up here, they could be used for sending up, where there's typically just a name and address of the customers, if you want to send them that your latest catalog. This is all of the information that you really need named and addresses. The order details down here, well, who uses the order details? What you really want to know from this, or what this is really telling you, is how much stock you have sold. So the people in the warehouse who have to reorder the stock are gonna be very interested about what's coming up here and the quantities and the descriptions, because they've got to reorder the stock into the store. Payment details, well, the account department want to know if it was cash, because you've got to balance the cash at the end of the day, cash on delivery, if it's gonna be coming later, and so on and so forth. And again, accounts are gonna be interested in total of the sales, how much money is coming into the store. Now, so they all have their own little uses, but individually set up like that, they are a bit, well, they are limited use. What we really need to be able to do is to relate a particular customer with what they ordered, how they wanted to deliver it, and how much money did the store get. So this is where the relationships come in. What we need to do is create relationships between these three tables here and link it back into the customer details table. You possibly can't read that on the screen, but it says customer order number. So that's gonna be a unique number there, and that number will actually also appear in each of these three tables down here, at the top, I haven't shown it on the diagram because I just got this from the previous slide, but each one of them will have the order number in there. So for any one of these, we can relate it back to the specific order and the customer number. I mentioned that each of these have their own uses for different departments or different purposes depending what you want to do. The point about it being that if you just want to send someone a new catalog, you don't need any of these tables. If you're just interested in reordering stock, you just need the details or the details table. Yeah, so by splitting the data up into separate tables, you can limit how much data you need to process to get the information you need. And that is one of the key design elements of relational databases. You only store data once by breaking it up into tables, you're hopefully gonna reduce the amount of data you need to process. Doesn't always work because one of the key things about a relational database are these relationships themselves, which allow us, if we need to, to join two tables or three tables together to get a larger set of information. So advantages of using tables, small amount of data in the manipulator, I've just said, and the security of the data. What I didn't make clear on here, having split these out into separate tables, a modern relational database would allow you to restrict who can see each of these tables. So for example, the people in publicity who are sending out the catalogs, but they have no need to know how much money a customer spent. So you can hide that from them or stop them seeing it. Disadvantages is that it's slower to update because whenever you go back again to our original paper sales release, all of the information was put onto that single sheet of paper. Whereas here, I've now got to add bits of data into four separate tables. So it's easy to imagine how that's gonna be slower to update. And I've already mentioned that more complex queries may need data from more than one table, which adds to the complexity of the query. What is a desktop not enough? What you need to remember is that databases can be very, very large, gigabytes, millions of records quite easily. And potentially there can be too big to store on your desktop. Your desktop has a relatively small machine. Yes, it's okay for browsing the internet and what have you, but storing large databases is not really what it's designed to do. Another thing is that the data may not be yours anyway. It may be someone else's database who has allowed you access to it. You may only be interested in part of it, i.e. some of the tables or just part of some of the tables. Fortunately, we can make it easy to share databases. But we'll need to look at how the environment is constructed first. So a database system, I'll put system in quotes because it's optional if you use a word system. It's typically, it's not a single program like Excel is. The two main parts, you get the database engine itself and this defines the database type. So by type there, not only whether it's an SQL or no SQL database, but potentially also whether it's an Oracle database or a MySQL database because they will each have their own way of actually storing the data on the disk. And they need to know that because they need to retrieve the data. So it stores the data on the disk for you. It needs to interpret and act on SQL requests. That is the queries that you're gonna send to it. And then having interpreted and acted on the request, it's gonna return the data from your queries. All of that is the job of the database engine. The other side of this is, well, how are we gonna communicate with this database engine? And we do this using the user interface. We can communicate with the database engine and we can do this in many different ways. All database systems come with their preferred user interface, usually some kind of GUI interface. But there are other ways of doing it as we will see later on. So imagine Microsoft Access. Now this is in fact very much like Excel in that it comes as a single package. And when you install Microsoft Access, you get a GUI interface and you get a database engine. You can't really install a database engine without the GUI, but you can actually access the database engine without the GUI. It comes as a single package. But some of the larger database systems, SQL Server from Microsoft, Oracle, MySQL, they come with a GUI, yes, and they get a database engine. And you see this just arrow just to show there's a separation between the two. They are two separately installed items. Again, if you're doing this on your desktop, you typically want to install them both, but you don't have to. Install a database engine without installing the GUI. So GUI database engine, we don't need to have the GUI. We can provide some other means of talking to the database engine. Anything that connected database engine you will do, and there's lots of things which will do that. So here's a selection of typical database types, MySQL server, MySQL, Oracle, SQLite, which you will be using later on, Microsoft's SQL server. And the point about this is we may want to talk to any of these databases from, say, R, or Python, or Excel, or SPSS, and lots of other third-party programs as well, programs and applications. All we need to do is have a means of communicating between the two. The standalone GUIs are available. These are only a very small selection. The MySQL workbench, unsurprisingly, is coming from MySQL. DB Bride for SQLite is a separate product that is designed to work with SQLite here. DB there, as far as I know, works all four of those databases, and it's a product which is designed or sells itself, if you like, I'm saying, I can connect to any database that you care to mention. We also have sophisticated text editors like Microsoft's VS Code, and they provide very rudimentary kind of interface where you can tell the queries, but it doesn't have many of the useful benefits that a GUI would have for you, or would value it. So that's all the subject, separate, I don't know the message of my module, that means hopefully I'm going to continue, so okay, oh, it's re-established. Okay, so we're saying that the easy access interface to GUI and the database engine can be separated. They're not the same thing. They don't have to, not only can they be separated, you don't have to have them on the same machine. The GUI could be on your desktop or laptop because that's where you're going to write your SQL queries, and the database engine could be on a far, far larger machine, hundreds of gigabytes of memory, which is obviously going to facilitate very large databases for you to use. So if we look at this in pictures, we have the GUI and the database engine, that was the MS access type environment. We can have the GUI on the desktop, we have the database engine on the server, and the point about this separation of the desktop GUI and the database on the server is that in real size, this is your GUI down here on your desktop, which is just a typical desktop size, but the database engine could be on a very, very large machine, allowing you to process very, very large databases. Of course, the thing is that you're just writing the SQL query on your little GUI, not just text, you send it off to the database engine, processing a database of perhaps hundreds of millions of records, but if your query only returns 100,000 records, then that's all that has to be sent back to the GUI on your desktop. So you can use this for the large storage and processing, and then just return to your GUI on your desktop, the data that you want. And when I say GUI on the desktop, it doesn't have to be the GUI because it could be any of the other interfaces that we saw are a Python and so on. So that's like the background of SQL databases. So now I'm gonna start looking at some SQL itself. Two basic types of SQL commands, DDL statements and DML statements, DDL statements, data definition language, they're used to create databases and the tables within the databases and generally manage the environment. And we're not gonna be terribly interested in them because we're gonna start with a pre-populated database. The data manipulation language, or DML, that's used to deal with actual data in the tables. So we're only interested in the DML statements, I've just said, and really only one of these. There are, in fact, four basic ones, create, which in SQL terms is referred to as insert, read, which is the select statement, which is the one that we are gonna be looking at, update a record, delete a record, or a row from a table. And these are collectively referred to as CRUD statement, C-R-E-U-D. We're gonna go, in demonstration, I'm gonna start in a minute or two, we're just gonna assume that we have some data in our database. I'll show you how we got it in there when I show you around the interface. But so all we're really interested in is the select statement, reading data from our database or our tables in the database. This may seem very limiting, but the reality is that most of your time on database work will be writing select queries and getting data out of the database. After all, you create the database only once, you put the data in there only once, but you can query it as many times as you like. So what can we do with a select statement or select query? We can take specific columns from a table, tables. We can ask for specific rows from a table. We can add new columns to the output of a query. I'll come back to that in a minute. And that output can be based on existing columns or on a variety of built-in functions. We can create aggregations from the rows in a table by grouping the values in one or more columns and we can produce sorted output based on column values and so on. So you see a lot of these select specific columns, i.e. not all of the columns, specific rows, i.e. not all of the rows, create aggregations, summing things together. These are all things which are actually gonna reduce the size, the amount of data being returned to you. What we need to know, a couple of things, nothing in a select statement can affect the underlying data in the tables. So in that previous slide, the reason I've put that, whoops, the reason I've put that in bold is when we're adding new columns, it only affects the output of the query. It doesn't affect the underlying data in the tables. The output of a select statement is always, excuse me, is always a table, even if it's only one row and one column. One of the very common things you query on a table is how many rows are there in this table. The answer is gonna be just a single value, but it's still a table with one row and one column in it. Various ways of saving the output from a select query. You can either save it within the database as another table or what's called a view. We can think of these as being quite synonymous, though in turn, either very different. A view, again, just looks like a table with rows and columns. Or we can write it to an output file, typically CSV, because that naturally leads itself to rows and columns. Right, the demo. Rest of the webinar is devoted to the demonstrations. We can look at a GUI, and this is just a GUI. I'm gonna use DB Brands as first to your like, because that's what our database is. But there's lots of things about it which are very common and I'll point them out, common to other GUIs. And then we're gonna create and run some select queries. The dataset we're gonna use to do these on, the dataset I've put into our database comes from the UK Data Service. It is SN7613, the census microdata. And I picked this one because it's not, well, I basically picked it because it's got over half a million rows in it, which is a decent size to be working with a demonstration database. Okay, so I need to just come out to that and go into DB browser. This is an application that you can download. It's freely available. You can install it on your own machine. And it's called DB browser for SQLite. It's designed to allow you to manipulate SQLite databases. And when you load it up, it looks something like this. I've tailored this slightly for my usage, but essentially you have these little tabs across here. And the one regular it's spent all our time in is this execute SQL, which allows me to write an SQL statement in here. The results will appear in this middle panel down here and some messages at the bottom. I'm just gonna make this change the size of it. The first thing I'm gonna want to do though is to open the database. And an SQLite database is in fact just a file with a prefix of typically DB or SQLite. So it's a single file and just click open because I've created this one previously. And when it opens it up, you can see on the left-hand side here under DB schema, it's got five tables in it. One, two, three, four, five, and not much else. Okay, if I want to see what's in any of those tables, I can go to the browse data and have a look at the table contents. And it'll show me what's in the table. Very numeric data, this particular example. All of these things showing what is in the tables, list of the tables, the column names, what have you, being able to browse the data easily. These are things which are very common to all of the database queues. The audience slightly differently obviously, but these are things which are always going to be there. And also this execute SQL allowing you to write your own queries. In addition to writing your own queries, you can actually also load a file of pre-written queries, which I'm gonna do here, select SQL. And in here, I've got just a little file of SQL queries and these are what we're gonna run, some of them anyway. So this is our first little image, first little seeing of SQL. So I'll just go through the first few quite carefully and then we'll look at some of the more complex ideas. So the first thing I said you could do is we can select columns from a table. I think this table here, CMT11 mod is a table we're gonna use most of the time. The star here means I want all of the columns. Okay, so it's not really making it smaller here. The green atop here with the two dashes, this is how you indicate a comment in SQL. So two dashes means that anything following that to the end of the line is just comment. So effectively ignore it, it's just for your documentation purposes. This line at the bottom here, limit 10 is, it's not strictly SQL, it's commonly implemented as part of SQL, but it's one of the statements or clauses if you like, which can change between one database system and another. So for example, in my SQL, it would be written as limit 10. If I was using Microsoft SQL server, I would say top 10. And I guess that changes very slightly. But the majority of the SQL that we're going to see, the select and the from is the same regardless of what database system you're gonna be using, SQL database system. The semi-colon at the end is just to delimit the end of that statement. And that's there principally because these files, as you can see, can have multiple statements in them. So if I select my first SQL statement, a little run button up here, select star, select everything from that table, but only give me 10 rows. So in this middle section down here, it's a bit smaller, we'll see all 10 of them. You can see I've got 10 rows coming back and all of the data in that table. What we really said though, we want to be able to just pick selected columns. So here, rather than using the star, I've actually given a list of the column names I want it to return. And now I've still got the limit 10 in there, but now when you've got the three comments I explicitly requested. Just to point out that the star really doesn't mean all of the columns. You can't use it as a wild card as you can do on some systems. So if I try to run that, I get all red lines at the bottom saying you can't do that syntax error. How you write your SQL statements is entirely up to you in the sense that they are free format. Yes, you've got to have your select and your from. You've got to separate the columns with commas and don't forget the semicolon at the end. But other than that, it doesn't matter if they uppercase or lowercase, let's make sure they're, it doesn't matter if I have got multiple rows or I could have written this all on one long row if I wanted to, that doesn't matter. It doesn't, it's not interested in the wide space in the query. One thing I point out about all of these GUIs, they understand the databases that they're meant to be operating on. So what it will do is if I start typing in select, see how I get to select and when I put the T on, it knows it's a select ZIM and it changes the color. And most of the systems will do this a little bit of help to help you with the syntax, it understands the syntax, it knows how to do it and it will color code various things. And if you're looking at this column here, which is a column called sex, the reason it is in green, that greeny color is because I've also got a table called sex up here. And again, because it's not interested in the uppercase or lowercase, it sees it as a table name, which is slightly confusing for us and so it does get that slightly wrong, but other than that, the color coding is there to help you get this syntax correct. So that's how we get specific columns. What if we want specific rows? Well, this introduces the where clause and we say the first part is very much a same select from, but now we're going to ask for specific value in one of the columns, where gender equals female. If I run that. Now here, I didn't, I've removed the limit now, so you can see at the bottom here, that has returned 289,000 records. And you see the genders are all female. The types of expressions you can use, I've got equality there, greater than here, and you can have a whole host of other things. So a more complex example here, where I've got ands and ors to make a more complex expression, and I can use brackets to enforce how they're interpreted. So making sure that these ors go together and not the ands. So it's a way of enforcing precedents of the expressions. As I said, all the usual operators for greater than less than and so on, and you can use and in order to add them together. So here we can make a far more complex query and run that, and now we're, I can run that all correctly. So if we look at these gender, silly course, female, region is less than three, well, I've got ones down there, I have twos later on, strictly less than three, and a range for the age groups. There are other operators you can use which can help you simplify the coding of the queries. So in this example here, I want age group greater than two, strictly greater than two, and less than or equal to six. I can run that and get the answer. But rather than doing that, I can actually say age group between three and six, using the between keyword there. Okay, and there's also an in keyword here where I can say in and then in the brackets, what values I want some doing exactly the same query again, hopefully getting exactly the same answer. So there's other keywords you can use, and they're just really there to help you simplify the way you write the expression in the where clause. We've also got the like keyword which is operating using wildcards now. The percent sign there means any number of characters. So here, if I run this for the region name, I've got northeast, and if I go down towards the bottom of that, I've got northwest as well, okay. One important thing about data, which you may have come across in your other little adventures with data, is that you can't always guarantee things of the right size and people have to put spaces in the data and what have you. So one of the building functions, this is a thing called trim, which allows you to or allows the system to remove any spaces at the beginning or end of a value in a column. So you'll see this trim use quite a lot to help clean the data. In this particular case, it doesn't make any difference because it's nice clean data coming from the UK data service. So that's not gonna make any difference. I'm looking at the length of the region name and then the length of the region name after I have trimmed it. And if I run that, I think they both always come out as 10 because it's nice clean data. But having extraneous spaces can be a bit of a problem with data, so you tend to use that quite a bit. And finally, we can sort the data on any of the column names we want, any of the column names which are mentioned in the select at the top here, that is, and they'll just come out by default in ascending order, but you can change that to descending if you want to. There's an example of using it in descending order. Okay, so the basic things we can do with the select statement, just occurred to me, I haven't done any aggregations there, never mind, basic select statements. So the next thing I want to show you is that this has all been done using this GUI, which is very useful when you're practicing and you want to try out your queries and what have you. But you don't always want to access the database from a GUI because typically the results are just written down in table here. Yes, there's functionality which allows me to save the results for you and I can export it as a CSV file if I want to. But lots of time, we actually want to use the database system to do the heavy work of manipulating the data and then have a return to some other system. So, leaving that, I think just, I'm changing the, I'm just going to show you how we can access our database system from Excel. So this is just standard Excel from Office 365. I'm just going to open a blank workbook. And then here, I'm going to go to data, data and get data from other sources. And in this case, I'm going to use a Microsoft query. Okay, I could use ODBC, but I'm going to use Microsoft query. And what this will do is bring up a list of data sources which are available to me and the one I want to use is this SQLite data source here. I click on okay and then it's going to come up and say, well, okay, so we want SQLite, but what is the database name? So here I can just browse to where my database is much as I had before when I opened it in DB browser. So that's my database. I'm going to open it. I'm going to click on okay. And this is a Microsoft query little bug which I'm going to say okay to there. Just do that. This should come up straight away in fact show me all of the tables. So from this, you can conclude that even at this point or by this time if you like, Excel has spoken to the database and the SQLite database engine and got from it a list of all of the tables. Okay. If I click on any of the tables, I can see the columns I've got in them. If I select on the table, I can say, oh, I want all of the columns. Yeah. I'm just actually going to undo that and use one of the smaller ones like the occupations just because it's a smaller table and say, yes, I want that. And the next step will say, oh, do you want to filter the data? And you can say, well, ID and then you can put conditions in there if you want to. So that's equivalent of writing a where clause. I'm not going to do that. I'm going to say move on and say, do you want to sort it by anything? Well, no, I don't really. I just want to go on to next. And then I get the finish option, return it to the query to Microsoft Excel, view the data or edit the query in Microsoft query. I'm just going to click on this just to show you what that's going to produce. Clicking on the top one would just literally return it to the workbook here. We click finish and then Microsoft query itself comes up. And you can see here now the data from the table which I've selected. But if I go, if I click this SQL button here, you can see the SQL query. So you can see the select and the from clauses. And that is the SQL query that Microsoft Excel is going to send to the SQL like database engine in order to get the data to come back. I'm going to bother changing that, but theory I could just delete this and write a whole new query if I wanted to. So I'm going to cancel out and I'm going to click on the button here, return data and just Excel saying where do you want me to put it? And there is the data coming back from the table. Now in that case, I want to take in all of the data but given that I had the options of creating a query of any type I wanted to, you can still imagine this of writing or reading a database of millions and millions of rows, a query which is just going to aggregate something or select certain fields and rows and just return something a lot smaller which is can be manipulated in Microsoft Excel. If I don't want to use Microsoft Excel, I thought I'd use a programming language or something like that. Or in this case, Python, I'm going to show you. No, it doesn't really matter if you know Python or not. The only things I really want to point out in this little few lines of Python code are these two I've put in red here. The first one is I have to connect to the database and that's exactly the same as what happened when I opened the database file in the GUI, in DB browser. Exactly the same thing as I did when I accessed it from Excel. I had to say, what database are you interested in? And the other one is this variable SQL into which I just put in my standard SQL query. Select star from occupations. The rest of it's just Python code but you can see at the bottom here how the data is returned exactly the same table as I've just shown you in Excel. So there's many, many ways of getting the data back from a database. You use the GUI typically when you're developing so you can see how the query is going to work at all in the first place. You get the benefits of the color coding in the IntelliSense, which you don't get in either of the systems.