 Welcome to the introduction to data management for machine learning and AI. In today's session, we will cover querying and exporting data from a relational database. Why would we be interested in doing so? The reason is that most introduction courses to data science or machine learning work with comma-separated value files. Also, many training data repositories like Kaggle would distribute data sets in the CSV format. Well, this is perfectly fine when training or learning to develop machine learning applications, but as soon as you end up being in a real-world scenario that you need to extract data from an existing database, you need to be able to deal with this relational databases. Of course, this very short lecture cannot fully teach you how to do this. Instead, I will try to give you a short glimpse into how to query a relational database so that when the time comes and you must do this task for real, you hopefully remember which things to read up on. To summarize, the goal of this lecture is to provide you with a quick overview into SQL, the structured query language for relational databases. We can then use SQL queries to aggregate data from multiple tables and export them to a comma-separated value file or an Excel file. Alternatively, you can also feed such query results directly into a pandas data frame. You probably will learn about data frames in other courses. However, keep in mind that the key to mastery of SQL is hands-on practice, so exercising is very important here. We will be using real data found in the musicbrains.org database for today. Musicbrains is a community-driven and encyclopedia about music artifacts like artists, recordings and albums. The data used by the Musicbrains website is then stored in a relational Postgres SQL database management system. Of course, using a proper database management system is important to power such a popular website as potentially thousands of page views will be needed to be served per minute and the performance and reliability features provided by such a database management system are a key towards that goal. We will be using a local copy of the Musicbrains database running on my own system, which is possible as the data is available for download under open access licenses. As a side note here, there is also a web service API interface available which allows you to query Musicbrains data without replicating the full database locally. We will not be using this API today, but keep in mind that many web platforms provide such access which can be very convenient in certain use cases. Check the API out later. This is an example page of the Musicbrains website, throwing the detailed information for the composer John Williams. Here, we see a list of all John's recordings and some extra metadata about him as a person. Our plan for today is as follows. We replicate all the Musicbrains data into local database management systems and then run SQL queries to create various single-table data files for later experiment and analysis, for example using Python and Pandas. Sounds easy, right? But let me remind you that typically 80% of the time in machine learning projects is consumed by data management tasks. Maybe a little anecdote from when I was preparing for these slides. I assumed, naively so, that downloading and importing the Musicbrains data into my own DBM-S and getting some simple queries done would maybe take me about, well, 20 minutes. However, three hours later it was clear that this was overly optimistic. So what happened here? Ah, it was actually just the usual stuff. Part of the input scripts I used were outdated. There were version conflicts between different software libraries I had installed in my system. The documentation of the script was incomplete and outdated. And finally, after I managed to replicate all data, it turned out that the schema used by the database was very complex and it took me a long while to actually understand it. Also, some of the data I hoped to get was not available, as it was restricted due to privacy laws. For example, ratings given by individual users. I would argue that many of these issues are, well, actually somewhat normal and it is not uncommon that dealing with data import tasks takes much, much, much longer than you might think. But let's leave these troubles behind and have a look on what we get. We get over 2 million artist records with over 27 million recordings and over 3 million releases. Releases are things like albums or singles. This is also a lot of data in sheer volume. The compressed data dumps I downloaded were 5 GB and after extracting and importing all these dumps, they take out 33 GB on my hard drive. This is actually quite a lot of data and it's certainly way more data than you would typically want to manually handle in this comma separated value text file. And I certainly would not recommend anybody to even try to query such data directly from a file without using the awesome powers of a database management system. Here you can see an example logical schema for a music brand style relational database I was using in one of the previous videos. It looks very easy to understand. Artists are connected to recordings. Recordings are connected to albums. Well, and that's essentially it. This, of course, was just a toy example. The real schema used by the music brand database is much more complex than that. It actually uses a mind-staggering number of 296 database tables and if you want to import some extra metadata, you can easily add another extra 50 tables. The online documentation of the schema uses this extremely simplified visualization skipping over a lot of detail. But this diagram is good enough, and you stare at it long enough, to get the first understanding on how the database is structured in general. I also tried to create an automatic visualization of all these 296 tables used in the schema which resulted in this super interesting piece of art which you can see here in the middle. Or I can't because it's just too tiny. I even gave out trying to understand this, way too many tables. Why am I even talking about this? Well, my takeaway point is that real data used by real system is often very big and very complex and by far exceeds what you typically see in small toy examples used in an educational setting. Also, many of the data sets you find in repositories like Kaggle actually have been carefully preprocessed, curated and simplified before being shared. They are not realistic for things which you find in real relational databases in real systems. On this slide, however, you see the visualization of the real data structures. Now, before I fall back to using my toy examples myself, let's let this sink in and be prepared when the time comes when you're faced with real data with a similar level of complexity. Good. It's toy example time again. I will make a simplification to the music break schema which look like this diagram. I'm only focusing on the entity types I'm interested in, which in this case would be artists, recordings, releases and release groups. Release groups are things like albums and singles but from a conceptual viewpoint. A release group could for example be the album Star Wars Trilogy Soundtrack and release groups can then have multiple releases which typically happen in different countries or different mediums. For example, there might be a German CD box release of the Trilogy Soundtrack album or another release in an old school fashion using vinyl. The fascinating part about this simplification is that I could do it completely within my local database management system using a concept called views. I will not go into details here but views are virtual tables which can be defined using SQL queries which then dynamically load data from other existing tables. So essentially, I created a simplified logical schema but when I'm using it, the DBMS will translate and map all queries I'm going to do to the significantly more complicated schema with the 296 tables you have seen in the previous slides and I wouldn't notice this at all. This allows me to run simple toy examples on the next slides and then still use the full extent of the 2 million artists 27 million recordings of the real database. Okay, let's now introduce SQL, the standard structured query language which is used by all relational database management systems. As a side note, many people pronounce SQL as SQL but they mean the same thing. I do not expect that you can understand SQL just by listening to me today. You need to read it up on additional concepts and also create practice, practice, practice in order to get it. However, I can provide you with an overview of the general idea. First off, SQL is a declarative query language this is vocabulary inspired by informal English querying. Declarative here means that you describe what you're looking for with your query but not how to execute that query efficiently. This is one of the most awesome features of a relational database management system. The question of how to execute and optimize a query such that is as fast as possible is automatically handled by the system and you don't need to do anything here. The basic syntax of SQL looks like this. Select attributes from tables, their condition holds. Of course, it can do so much more. It can also join multiple tables, compute statistics can create aggregates and can run sub-queers or set operations. But the very basic is just this. Select from where. Let's try our first example query. Find all the recordings of John Williams and return their ratings. How would that work? Well, select John Williams from the artist table but at the same time also select all the recordings which are connected to John Williams. This connection is expressed by this intermediate table here in the middle between artist and recording and I call this table for now link artist recording. This table contains artist IDs and recording IDs. An artist is linked to a recording if this table contains the idea of that artist and the idea of the recording in the same row. But let's start simple. Let's first find John Williams. The SQL query we create for this would be select star from artist where name equals John Williams. The star here represents that I want to return all attributes of the artist table instead of selecting a specific attribute. Then executing that query on my data replica it turns out that there is quite many artists called John Williams. Too bad. Things are never really that easy. By looking up some of these John Williams on the music print websites I found out that the John Williams I was looking for is the one here with the ID 94. Just to make that clear, I'm looking for the John Williams the famous composers who made the soundtrack for movies like Star Trek or not sorry Star Wars or Indiana Jones and not the John Williams the unknown sound technician who makes the CD recording for the high school band. But now that I found him how do I find all the recordings connected to John Williams? Well, I will now just show this SQL query statement on you and later when recapping this lecture try to analyze it more carefully. What are the main parts here? We will select from the artist table where the artist ID is 94. Additionally, we join or link all the found records such that the artist ID matched the artist ID found in the link artist recording intermediate table we have seen before. Combining tables such that the ID matched in this way is what we typically call a join. So, we join these two tables. Then we do exactly the same thing for the recording table. Finally, we decide to only select recording ID, name, length and rating but not any other kind of attributes. We can see the result in the table on the bottom right and this can easily be exported to a comma-separate value file an axle file or maybe even directly loaded into a pandas data frame. This is pretty cool and easy. Let's try another query. This time trying to find all of John Williams albums. Album here means release groups where the type is album. It's the same idea as before. I start by selecting from the artist table the artists which have ID 94 which well should only be one. Then I join the artist table this link artist release table join that with a release table and join that finally with a release group table. And at the very last we select only those where the release group type is album. I also got a little bit lazy here in my example and use the feature of SQL for defining shortcut names for tables. So instead of always typing L artist release I temporarily renamed for this query only this table as LAR on the artist table I simply just call R, A here. Again, I would recommend you to look up several other examples of SQL queries read up some extra documentation and definitely try some of your queries yourself. Let's make another observation about some of the awesome features of relational database management system. I told you before that the database I replicated has over 27 million recordings and 2 million artists. The queries I just showed you were only selecting one specific artist out of these 2 million and then try to find all the recordings of this artist which should be around 6 to 700 out of the 27 million stored in the system and a data set which actually takes 33 GB on my hard drive and all of this just happened in 69 milliseconds on my desktop computer. I assume that sometime in the future you will be experimenting with Python and Pandas. Then try to find a specific record in a slightly larger comma separate value file and use a stopwatch to check how long this takes. I would bet good money that it would actually take much longer than 69 milliseconds. This is one of the reasons why I would want to have a proper relational database management system when doing complex queries because over 50 years of relational database query processing research and development have taught us how to execute these queries really, really, really fast using all kinds of internal optimization and tricks. However, the whole point of this exercise was to extract data sets which contain exactly the type of data you want to need for your later machine learning application development. So all you now hopefully need to do is exporting the results and then loading them without ever querying them again. Or maybe to rephrase that much more strongly if you ever end up in a situation where you have multiple, extremely large comma separated value files and you try to join them in Python something then horribly wrong in your data preparation pipeline. Good, but let's get back to this in exports like this file here. It contains all the recordings of selected artists well, just one in my toy example together with their ratings with some extra data files like this could for example be fed into training of a music recommender system. So, what did we learn today? SQL, the structured query language can efficiently and flexibly query relational data. It can be executed very fast and the result of an SQL query is always a single table. This table then can be easily exported for example into a comma separated value file or an exa file supporting downstream AI and machine learning applications. Thus, what we have seen today was a toy abstraction of an ETL extract transform load pipeline extracting data from an imaginary enterprise database system for later analysis. What should you be doing now? Well, definitely read up some more advanced SQL concepts and practice a bit. And with this, I thank you for your attention today and hopefully see you another time.