 this video is about using a database for CSV management. So if you spent some time doing some data science projects, it's possible that you have different management strategies for CSVs. One option is to have a CSV maybe in your root directory. Another option might be to have CSVs in one single repository. All of your data goes independent of the repository that you might be working in. Recently I've been trying this approach where I push all my CSVs, even ad hoc CSVs, to a database as a way to organize it. So the database almost acts as a tree and the CSVs are just ornaments I put on that tree. The database I've been really liking has been DuckDB. It's fast, it has a lot of functionality. It's not something I'm necessarily trying to put in production, but it has a lot of similarities with SQLite, a database that I've used quite a bit in the past. And SQLite is known for being a database that works on your computer, which is different from working, which is different from how MySQL works, where you need a client in a server. So they call that an embedded database. So both DuckDB and SQLite are embedded databases. One feature I like with DuckDB is how easy it is for it to automatically read CSVs and automatically type a CSV, so I don't need to worry about defining a schema. So the command that I'll show in a little bit is just a read CSV auto, which will auto infer the schema and populate a database with CSV files. What I'm doing in the background here is I'm downloading DataBeaver. It's entirely possible to use DuckDB from the command line. I think there are some benefits to having a database IDE. I'm not going to show off any of those benefits in this video though. Now it took a couple of tries. I couldn't quite remember how I set up a DuckDB database. So usually the way I organize things is I've got a code repository and the different projects I have go inside my code repository. But data, I've always had a data directory and I've been I put CSVs in my data directory. But in this instance, I'm going to put a database in my data directory. And all the database is a file that is called something. It has some kind of name, but it has the file extension DuckDB. So first I thought what I could do is I could just make a file with touch, but that doesn't work. So when I test a connection, it'll automatically prompt for a download of the DuckDB driver. And then because I just touched a file, as opposed to making one properly, it said, hey, this CSV DuckDB isn't going to work. So all I had to do is remove that file and then allow the database IDE to create the file for me. I test a connection and everything works. Another thing that bothers me with Dbeaver when it comes right out of the box is that it's got a light theme, which doesn't really mix well with my ever forest GTK theme. So I searched around a little bit, went to properties and eventually found the location I needed to hit to change the appearance. So I changed the appearance to dark and applied. And then I restart Dbeaver. And with that, I've got a nice theme. I can see that there is a CSV DuckDB connection where I'm going to put some CSV files. Right now, if you go to the main schema, you'll see that there are no tables or no views. So that's what's going to be populated. So what I'm going to go ahead and do is look for some data to download. A pretty classic dataset is the diamonds dataset. So I just found a repository where someone stored a CSV file, the diamond CSV file is part of their project. So here I do a W get and just bring down that file. So the goal here now is to create a table from a CSV file. Normally what this would require is a schema setup where for each column in the CSV, you have to say if it's an integer, a VARCAR, a date, if you're using SQLite, there's only a couple of data types that can be specified. DuckDB actually has a huge variety of data types. And it does a pretty good job, at least from what I've seen so far, inferring those data types in the read CSV auto function. So you do have to specify the full path when you do a read. I was trying to do this by memory at first, and it just wasn't working out. So I had to copy one of the examples from their documentation pages. Again, this is kind of a newer workflow for me. So I'm getting a feel for it so far really enjoyed this way of organizing files, but I'm definitely not an expert in this workflow. So I'm getting an error when I first try this. I'm not sure what I was doing wrong here. So I just did a copy paste and replaced their example with my parameters. Maybe there was a misspelling, but after I ran the create table as like star from read CSV auto, the CSV was imported into the database successfully. DuckDB has a describe command so I could see the columns and the types. And again, from what I can tell, it does a pretty good job at doing auto typing. So I think it's possible that there are some data scientists out there who may have only used libraries to wrangle data and haven't used that much sequel. And I think that is not bad necessarily. But there's a lot of really cool grammar that has already been invented as it relates to manipulating data that's in tables. And so I personally really like using either libraries or domain specific languages that take advantage of some of the vocabulary that you'd find in sequel. And so naturally, I might be the target audience for something like DuckDB. So again, some of these operations where I have these ad hoc CSVs in DuckDB, I'm kind of doing quick analysis. So maybe I'm just kind of seeing, you know, what were the column names again, how many rows that I have, what happens when I average and do grouped averages, fairly simple things. And I don't want to discount DuckDB because it can do quite a bit more. And I'll show, I'll show some of those functions. Right now, I'm just downloading some more CSVs to load into DuckDB. It's nice to see something done a couple times just for repetition. I think this data set is the Sacramento data set, it has something to do with real estate. I almost think I've seen it before, maybe in the introduction to statistical learning book. Anyways, the content in the CSV isn't really what I'm too interested in. I'm just trying to show the process of uploading data, uploading CSVs into a database and hopefully showing that this is a very simple process. In fact, this create table line could be saved as a script, which is what I have at my work computers. I have a script where I'm just reading, it's specifically there to read CSVs into the database. So after refreshing the database, you can actually see the tables in the database. You can double click on them and see the data inside. If you want a quick preview, you can look at the columns and the data types associated with those columns. DBV is a really good way to interact with data. And I think they just do a really good job making it easy to kind of thumb through what's in your tables. Alright, so now that we've got a few CSVs imported, I wanted to almost do a little bit of a tutorial on SQL. There's way better tutorials out there, but also just showing some of the functions that are available on DuckDB, all the math functions. So there's log functions, there's more mathematical functions like factorials, there's a gamma function, there's quite a few options to take advantage of. Now, a lot of databases do have these. Here's some text functions. So getting slices of strings, concatenating strings, getting the MD5 hash of a string, padding. Padding is something I use quite a bit. So as part of the text functions, I saw there was an LPAD. You know, you might have one database where you're working for retail and you've got one table where the items you sell are ID'd by integers in another table for whatever reason, like maybe it's a promo table, those items are stored as strings. You might want to take the integer field and pad it with whatever values are needed to make that join possible. So having that pad function is nice. Again, that is pretty common in other databases. DuckDB also has a string format time and string parse time. So this is something that you might have seen in other libraries, and it has some of this, you know, very standard syntax for parsing and for formatting. There's also interval functions. So you can add hours, you can add days, you can calculate date-date-diffs. There's some window functions. So what a window function is, is something that's applied to a window. I really like that they have a lagging function and a leading function. So when it comes to time series, you usually, it's nice to be able to lag, you know, seven days or two weeks or three weeks if you're dealing with a weekly kind of seasonality. It's nice to have that in the database just to see your week over week trends. Lots of interesting aggregate functions. I'm not sure what the histogram function does, but I've never seen that in a database engine before, so that's kind of cool. Maybe. I don't know, maybe it's not cool. Sampling is kind of interesting. When I was looking at this, I guess my wheels were turning a little bit. I wondered if sampling could be used as a way to train and test split folds. I haven't played with it too much. I'm not sure if you're allowed sampling with and without replacement. So if you wanted to do something like a bootstrap, I'm not sure if that'd be possible. So I think that's enough looking at the API. Now I just wanted to go over some really basic SQL just in case maybe you're looking for a little bit of a SQL refresher or tutorial. So select star from diamonds just means select everything. Select every row that exists in diamonds. Here I'm going to do a select star from diamonds where cut is equal to ideal. And I've got lots of spelling errors in there. Lots of typos I got to fix. So this is just looking for the cuts. It's filtering on cuts that are ideal. This is just an example of applying a function to a column. So just grabbing the price column and logging it. So side by side here if I can take price and log price and just put him next to each other. I'm not sure why I would do that, but just an example of how how to select columns and then how to apply functions to columns. Mean is a function that summarizes the data. So if you don't have any kind of grouping conditions, the mean just takes a straight average of that column. So here it looks like the mean price is $4,000. Here I'm showing how to rename a column. So you can take the mean price and just say as. As is the keyword to give a name to that column. So mean price becomes average price. Meeting price, I'm calling med price. And if I didn't want to know the overall average, I can group by cut and get the averages by cut and then add that column in there to get the actual name for the cut. I can take that same statement and then order by column. So here I just order by the median price. If I wanted to go descending, I could just add the descending keyword to make the highest price first to clean up the price. In this statement, I just round the average price because prices normally have just two decimals around to two. All right, I'm going to try some different things. So again, just starting with select star from diamonds. Now I'm curious how many distinct carrots there are. I see that it's numerical. I just didn't know how many distinct carrots are on this dataset. So I wanted to get a sense for that. And the previous statement, a look like cut wasn't really that correlated with price. So now I'm curious if carrot is a little more correlated with price. So I'm going to take that same statement I was using before, but instead of grouping by cut, I want to group by carrot and see what it looks like. I also think it's useful to have the number of counts in there just as n to show how many samples are being averaged over in that one row. That can give a sense for how much weight maybe you might want to give to it. I messed this statement up a little bit. I don't need count distinct in there, just count. So count carrot as n. And here I get a better sense for how many diamonds had that carrot. And of course, the more observations there are for the row, the more you might be inclined to believe that value may not change quite as much. But for the 3.51 carrot, there's only one observation. It's a pretty high price. Maybe that wouldn't happen all the time. It's just a one off. I remember reading in our for data science Hadley Wickham will sometimes do counts of counts. And then that way you almost get a Poisson distribution of how many different types of that thing there are. And he had some reasons for why he likes doing that in his analysis. Right now, it's not really coming to mind, but I know I do it from time to time. I just, I can't think of why I do it, I guess, as I'm as I'm looking at this. So that's a little tutorial on DuckDB and how it can be used to organize CSV files. Thanks for watching.