 Hello everyone and welcome to this Eucadata Service webinar, introducing the Open Database Connector. I'm Margrita, I'm a Senior Communications Officer working for the Eucadata Service and presenting today is Peter Smyth, he's a Research Associate working for the Eucadata Service at the University of Manchester. Thank you all for coming and let's start with what we are discussing today, which is what is ODBC? So what we're going to show and look at in this webinar is the definition and background of ODBC, an explanation of why we would want to use it, because it's not always necessary to use ODBC to connect to a database, but sometimes it has some advantages and that's what we're going to try and cover today. I'll explain what ODBC drivers are available because each database tends to have its own ODBC driver and you've got to make sure there's one available and you've installed the right one for your database. And then I'll go through the process of installing an ODBC driver. They tend to be slightly different for each ODBC driver, but there's enough commonality in the process for once you've seen one you understand what's going to be needed for any other one that you may need to install. And then we'll go through some examples of using ODBC. We've got examples in Python and R and in Excel. So let's start with the definition. What is ODBC? Well, it stands for Open Database Connectivity or probably more commonly referred to as Open Database Connector. It was developed by Microsoft in the early 1990s and effectively it's an API, an application programming interface for accessing database systems. We had a webinar on APIs a couple of months ago, so hopefully some of you will understand what an API is. But essentially what it is, it allows a common interface, a common set of commands which can be given to an application that the application understands and can respond again in a well-defined way and give that response back to the person who made the request. And this makes it very easy to provide interfaces to applications even though kind of scenes application may change over time. The API is often referred to as drivers, which I think I've already done once or twice, and certainly it's a way I would normally refer to as an ODBC connector as an ODBC driver. Oops, overshot there. Let's go back one. Each database system will have its own API and hence its own ODBC driver. So an ODBC driver is associated with a database system, not with the application that's going to make use of that ODBC driver. For Java-based applications, which are typically non-Windows applications if you're using Linux or whatever, there's also something called JDBC which is Java Database Connector and that serves a very similar purpose to ODBC. They are sufficiently similar for us not to need to consider them separately and we'll just be looking at ODBC today. So why would we want to use ODBC? Let's start this by considering the situation where you don't really need to use ODBC at all. If you've used Microsoft Access as a database part of the Microsoft Office package, what you get is a single database system package which contains two parts. You get a database engine which is used to store your data in and actually process the query to retrieve the data from the database, but you also get a GUI, Graphical User Interface, and this is the part which the user, the end user actually sees when they're trying to make use of Microsoft Access. This allows the user to interact with the database engine by, it gives you query answers where you can write your queries in a very graphical way if you want to, and it will also present the results in nice tabular format and allows you to create nice reports which can be generated and used offline. But essentially what we've got here is that there are two parts of this package all in one, all installed at the same time. The database engine and the GUI. So in pictures what we effectively have for Microsoft Access is the GUI and the database engine, and it all comes in the package which installs Microsoft Access and then both the GUI and the database engine are there for you. If you look at some of the bigger databases such as Oracle, Microsoft SQL Server and MySQL, they all have database engines and GUIs just like Microsoft Access does. The difference here is that they're essentially separate products. So what this means is you can store the database engine without installing the GUI and a single GUI installation can give you access to many database engines. Now in practice in our desktop environment that's not a very likely scenario, but for large organizations running multiple databases it can be very useful to just have a single GUI which controls all of the databases. So you can see the advantages of having the GUI actually as a separate item to the database engine itself. If you did install them both then effectively you end up with a very similar to Microsoft Access environment that we looked at a minute ago. The difference here is of course you've got bigger databases and there's more functionality involved. So what we've seen for Microsoft Access is that and for the larger databases what we effectively have is the GUI separated but connected to the database engine. There are separate applications but they're very tightly coupled. Again we're not using ODBC here, that's just the way you would normally install and use these database systems. So the point being that you can install the database engine without installing the GUI. So instead of this situation what you could effectively do is say I don't want the GUI and this is where ODBC comes in. So we now have some database and as I've said before this could be SQL Server, Oracle, MySQL and many many other ones. Almost all of the large database systems will allow you to connect by ODBC. And what we're going to do is get an ODBC connector which is specific to that particular database that we're going to use. And then from the ODBC connector we are going to be able to connect a whole variety of different products and applications. So for example we can use R to connect to ODBC and onto the database or we can use Python which all goes programming environments. Alternatively you can use Excel or things like SPS and Stata and what have you. So just get through that in words. You can have ODBC connections which can be used by many third party applications such as Excel, Stata, SPSS, MS Access. This is a different set to what I've just shown you on the picture but the point is there are many many more which can be used in this way. Equally you can use ODBC connections from programming environments like R and Python as we've already said. A slight complication here is that when you're using ODBC's drivers with these specific packages or these languages you typically need an additional package or library in order to interface the ODBC itself. These packages or libraries are essentially these mini APIs associated with the language themselves. So in Python you have the Pi ODBC package and in R you've got the R ODBC library and these are easily installed. If you have these languages installed on your desktop it's very easy to install these additional packages and libraries to make it to allow you to use ODBC. The next thing we want to do is look at how do we install ODBC drivers. What we're going to do is I'm going to go through a specific example of this but before we do that it should be pointed out that some of them are automatically installed with the software and in particular the Microsoft products like Excel and Access. When you install those on your PC the ODBC drivers associated with them are automatically installed. Notice as in Excel there Excel you can actually use both as a client application to access some database or you can actually use Excel as a sort of database and access from other sources as well. So that's on both sides of the coin and other Microsoft products do that as well. Microsoft SQL Server can be used to access other databases via ODBC but principally it's used to provide its own ODBC driver so other applications can access it. For those which aren't installed you need to download the code from a database vendor or third party site. Let's say download the code. Basically the ODBC driver is a program which you install on your PC. You can go to vendors like MySQL to get MySQL ODBC connector or you can give the Hive connector. Hive is the relational database associated with Hadoop which is a big data ecosystem which again we've had a few webinars on in the past and that's what we're going to demonstrate today when we do the demonstrations. If you're installing ODBC drivers you will typically need admin rights on your PC or laptop in order to install it. But once you have installed it you only have to install it once and it will be available to all of the users of that machine. So actually going through the process of installing the ODBC driver we're going to do this for the MySQL one just as an example. So you're going to go to the site which is this one here which is I think sort of mentioned at the end and for the information slide. And you download the correct version for your applications and architecture. So for example you might have a Windows machine and you might have a 64 bit application that you're going to use it with. Be careful, a modern Windows system will almost certainly be 64 bit Windows. What you're really interested in is how many bits your application is. Now that may not be something that you've ever considered before so I'm just a little bit of a warning here is that many Microsoft Office installations are installed as 32 bit applications and are in Python if you've installed them yourself you would normally go for the 64 bit installations. But either of those could be swapped the other way around so you really need to know which one it is that you're going for or what one you need. When you go to the MySQL ODBC driver site this is the sort of screen that you're going to get presented with. You select the platform Microsoft Windows and then you've got the option of picking whether you want the 32 bit installer or the 64 bit installer. You can actually download these as zip files if you want to but I would recommend you use the MSI installer because then when it's downloaded you just double click on it and it will run a little wizard to install the ODBC driver for you. What I said before about making sure you get the right one 32 bit or the 64 bit in fact there's nothing to stop you downloading and installing both 32 and 64 bit if you want to. But you do have to configure them separately depending on your application that you're going to use. So let's consider configuring these ODBC drivers now. Once you've installed you need to install a little bit of configuration and the reason for this is that the ODBC driver has to know things like who it is, who you are, what database you're going to talk to and so on. So it varies slightly from one driver and database to another but they're all basically the same sorts of information that you need to give it. And this is things like your username and password because you're going to access a database which is typically going to require you to authenticate to that database. So that explains them. You need to know the location of the database. If you remember back to that little diagram where we had the GUI and the database engine and the whole point is that we've now separated the two. We've stuck ODBC in the middle. Because the two are separated, the actual database engine doesn't actually have to be sitting on your PC anymore. It could be remotely connected to it or you may need to remotely connect to it. So you need to know where this database system is. It could be your PC, it could be some remote location. And again, because we're typically talking about the larger database systems unlike Microsoft Access where the database system itself can actually contain many, many databases, then you need to tell it what database within that system you're actually going to try and access. So what we have next is this is the example of a high ODBC driver that the window that you get when you say you want to configure the driver. And there's only a few, there's lots of things in there, but only a few of them cover things that we need to provide. So typically what we need to do is give it the host addresses, the database system you're trying to access. And you can see here in this example, all I've actually given it is the IP address of the machine on which this Hive database is going to be running. So that's that part. We then have to say what database we're interested in within this Hive installation. And the default database for Hive is a database called default, which is what I've left out of two because that's where all my tables are. But I still need to authenticate and so I still need to provide it with a user name and a password. But without that done, everything else I can leave the same. If you look, that would be the Hive ODBC driver. Looking at the MySQL one, it's actually very similar, a bit cut down. But again, bits of information that you need to give it are root password, a bit called user, which is root in this case, the password for root, the root user, and again the database that you're interested in using. If you may have noticed that on both of these, both of these, there is a little test button just to the right of the bit I've covered in red there. And if you click on the test button, it will actually make a connection to the database just to check for you that you've given all the correct information. It's a very useful thing to do so you make sure you do actually have a legitimate connection into that database system. So having set up your ODBC driver and configured it correctly, we can then think about using it. So what we're going to look at are some examples, and the examples I'm going to show you are, we're going to use Hive as a principal database. So we're going to connect, we're going to connect Hive to Python using Python, and then we're going to use Hive data to R, and then we're going to do it using Excel, which is a third party package. So we're going to leave the presentation for the time being and start looking at some of these packages. First one is Python. Now for Python, I've got a few little strips set up here. I'm using Jupyter notebooks as I've done on previous webinars. Some of the information I'm going to just run through quite quickly because a lot of it is just Python rather than talking about ODBC. But in this first line here, this import statement tells the system what packages I'm interested in, and here I've got my PyODBC. If I run that, it just runs, I don't get any response to it. Now that I've got this package in PyODBC, I can make calls like PyODBC.connect. This is going to make a connection to an ODBC driver, and the ODBC driver I'm interested in is this, Sample Hortonworks Hive DSM. If you run from a couple of screens back when I was configuring the Hive ODBC, that was the name that it had been given. So when I run that, I get a connection, hopefully, and then in this next little paragraph, I can actually start using this connection. And what I'm going to do here is I'm effectively, I won't go through all the Python, but effectively I'm going to run this little query, select star from the GeoAll table in the Hive database. That's about as basic an SQL type command you can get. It just says return all of the rows from that table called GeoAll. And what it's going to do is, in fact, that's what the command is, but then I'm just going to say I only want the first one, which is why I've got this fetch one. And then if that was successful, if it successfully selected a row from that table, I'm going to print the row. So when I run this, what I should see is the ODBC, or the Python program is going to collect ODBC, ODBC connects to the Hive database, it runs the query, selects star from GeoAll on the database. The database returns the results back down into Python, and then I can print what the result was. So if I run that, and I wait a second or two, I can see I've got this as the first row from the GeoAll table. It's got lots of different columns in there, won't go through them all, but that is effectively one row of that table coming back. So now I've actually got the data back into the Python environment. If I wanted to be more ambitious on the second one, again I've got the same select statement, but here I'm going to fetch all of the rows instead of just the first one. And here I'm going to print what is effectively the index number seven from each row being returned. And index number seven is actually this text line, a bit of text up here. So if I run that, it's going to go to the database, go via ODBC to the database, and return all of those lines, all of those values in that column there, and you can see them down there. As there's more than 14,000 of them, we don't really want to go through them all, but the point is I could have put them into other processing and offer whatever it was I needed to do. So just clear that. This next one is really just to illustrate how sort of flexibility system is. You don't have to have all of your query in a single like-like-like. You can set it up as two parts of a variable and you just append them together, and instead of using the cursor execute with actually a string, you can just put the variable name in there, and then it runs the same thing. And here what I'm doing here, I'm counting how many rows that are in the electricity table, which is again a table in high environments on the Hadoop system. If I run that, it comes back into over 3 million rows, which is just counted. Python hasn't counted the high-counted rows and just returned the answer to Python, and all I've done here in the bottom is to show that it came back in about one second. So it's quite fast in processing the queries. And then on this last one, what we've done with that so far is we've just done very basic SQL type statements. But in fact, because we're talking to Hive, we can actually send it any command that Hive understands. And one of the things that Hive understands is this is a command called show tables, which you can probably imagine is going to show me all of the tables that are in that database. So if I run that, then I just get a list of all of the tables available into the database. So for example, I've got a table called DT9. That's all I've got there. So that's using ODBC connector and Python together to access the Hive database. But now I move on to R. In R, again, I'm using the standard R Studio environment, which we've used before on other webinars. I'm just going to run these first few lines up here to set the scene, say what libres I want. Notice that this top library, RODBC, this is the one we're really interested in on this occasion, because that's going to allow us to connect to ODBC. And hopefully, I think that's okay. Then, I haven't done that. I've got a command here, which is very similar to what we saw on Python. The function name is different. It's called ODBC Connect here. But again, all I'm giving it is the name of this Hive ODBC connector I want to make use of. For example, Hornworks Hive DSM. And I'm sending that to a variable name there. So if I just select that line there and run it, and that comes back in a mix, you can tell from this line up here that that's made a connection. And again, the only channel is just a variable name. I could have called it x, y, or z. If I wanted to, I've just happened to call it channel. And from that point, in the same way as we did in Python, I can run another function from our ODBC library called sqlquery, which has quite a meaningful name. I'm going to run an sqlquery. I'm going to use the channel that I've just set up, i.e., the ODBC connector I've just made a connection to. And in here, I'm just going to do a very simple select statement in it written in sql. And if I run that, the results of that select statement, i.e., all of the rows from that ELECT09 job table, are going to come back and they're going to be written into this effect that is going to be a data frame called DF1. So if I run that, and you can see up here, DF1, it's now returned, and I've got 143,290 observations of 13 variables. If I wanted to have a look at them, I can do this and just see the various values for those rows in there. So the point is, having got this far, I've now got the data back from the Hive database, and it's now sitting in a standard R data frame, and from that I can do anything I like to it. So I've just got down here a little bit of R script to draw a couple of graphs. I'm just going to run that all together, just to show that what I've really got is genuine data from Hive, and when it finishes running, I've got a couple of nice graphs. I mean, I've just chosen to do a couple of graphs. Once you've got the data back in the data frame in R, you can do anything you like to it, as you would do with any kind of R analysis that you're interested in doing. Just a bit of variation. I've got here using something very simple and very similar, but here I'm using the MySQL ODBC driver, which I showed you downloading before, and again, it's exactly the same. I've just got the library R ODBC, which I won't need to run because I've already run it. I've changed this very name just to make it distinct, but again, the ODBC Connect statement is exactly the same as we had before, only this time I'm saying MySQL. In here, I'm just going to run a very simple query to return data from a table called City, and I'm going to put this into DF2. When I run that, it runs exactly the same, and you can see here I've got 4,079 variables. The other thing I just wanted to point out when we're doing this, so far this is very simple select statements and SQL statements and the show tables, which I did as well in Python, but you're not actually restricted in how complex these SQL statements can be. In fact, it can be anything which Hive, in this case, because we're using Hive Connector, or anything which Hive accepts as valid SQL or HQL, you can put into this query. So here, what I'm doing here, TableDT9, which we saw before, so I'm not going to run this because it already exists, but here we're creating that TableDT9, and it's selecting various columns from two different tables. I've got to join on here. I'm joining GeoGal in recommends, and I've got substrings and wareclothes and all sorts of things. There's no real limit on how complicated this query can be. The advantage of this is, of course, that you can actually, from your R environment, you can have Hive do all the heavy lifting work by doing your aggregations and joins and what have you, and then just have the results of all that returned back into your R environment, and then off you go and you can run your queries and draw your graphs and whatever you like. So there's no real restriction here on what you can actually put into this SQL query function, any kind of valid HQL. So that is using R. And finally, we said we'd use Excel. So I've got Excel open here. This is an empty workbook in Excel. If I get to data, and I can say from other sources and from Microsoft query, I should say that depending on what version of Excel you've got, these might be in slightly different positions. But Microsoft query would definitely be there. And from when you click on that, you get open data source and you can see down here, Sample Hive DSN, which is my ODBC connector for Hive. So I'm going to say OK. As soon as I click OK, Excel itself will make the connection to the database, the Hive database, and it will actually get me a list of all of the tables on that Hive database. So again, you can see DT9 is still up there. So what I'm going to do is I'm going to go for something called ELEC months. ELEC months. I'm going to say I want all of the columns. There's four columns in there. If you want to know what columns are, you can expand that and it will come back and tell you. So probably shouldn't have done that. OK. If you click on next, you can do things like filtering. We're not going to bother with any of this, but if you wanted to, you could do. We're going to say you can sort them by things if you want to, but we won't bother. All we're going to interest you is getting this data back into Microsoft Excel. So we click finish. It comes back and say, where do you want to put the data? Well, top left is fine. We just can say OK. And then we wait a few seconds. This little message up here is getting the data from me. Oh, look, it's returned all of the data. And if I go to the end of that table, you can see, well, pretty much 300,000 records it's returned. OK. And of course, once you've got it in Excel, you can do whatever you might want to do with the data, draw graphs or do further processing or whatever. Now, I just wanted to emphasize that. What I was saying before about the complexity of some of these queries, if I just go through this process again and it doesn't really matter at this point what I'm doing. I'm just going to pick this cordy to a record. There's no idea what that is, but there's lots of things in it. I'll just go through this again. But at the end here, instead of just saying return the data Microsoft Excel, I'm going to say view the data or edit the query in Microsoft. Edit query Microsoft query. And I'll just click on finish. Instead of putting the data straight into the Excel worksheet, I actually get Microsoft query come up and it actually shows me what the data is down here. But what is interesting, if I click on the SQL button here, I can see the SQL statement which it sent via the ODBC connector to Hive to get this data back. Now, I can wipe all of that out and write something complicated in there like my show tables or any other valid SQL statement. If I say OK, Microsoft query gets a little bit upset because you can't draw a picture of that but we don't really care about that because then what I get back is my list of tables as I had before and I can just return that into Excel in some way. So I can again write a query to suit what I want rather than just do simple selects and filtering from tables. So again, I've just done a show tables there but again I could have had to create tables or done complex aggregations in the Hive environment and then only have the results returned to Excel. So it's a way of getting the heavy work done on the Hive system, which is on a server somewhere and then you just get the aggregated results back into your application or your programming environment and from there you're in a comfortable environment and you can process that data as you may choose. So that is it for the demonstrations. So back to the PowerPoint slides. And a little bit more information where you can get these ODBC drivers for. I've given you the one for the Hive ODBC and the MySQL one which we use. Basically if you've got any other database like Postgres or anything like that, if you just Google the name of the database followed by ODBC, you'll undoubtedly get a link of where you can download the suitable driver from. And with that we have any questions? Thank you Peter for presenting today and thank you to everyone for attending. Hope you found it interesting and useful.