 Hey everyone, welcome to the USAR 2020 tutorial on DiskFrame. My name is Juliana, and along with Elkie, who is also here today, I organized the Our Lady's San Diego chapter. Today, we're really happy to host ZJ Dye's tutorial. ZJ has 13 years of experience in analytics and data science. He has a math background and is currently working on a general-purpose data science AutoML platform. I'm really excited to hear some more about this awesome package ZJ created called DiskFrame, and I hope you are too. So with that, I'll hang this over to ZJ. Hi, this is ZJ. This talk is titled, You Don't Need Spark for This. It's about DiskFrame, a larger-than-RAM data manipulation framework I have been developing in R. So we spend about 80 percent of our time dealing with data. So it makes sense to have really good tools for dealing with data in R. However, when you look at the data ecosystem, not just in R, but in general, you see that there's data of various sizes, and here I give a definition for data that sits in various data size segments. So you have your trivial data, which are basically data that can fit onto Excel spreadsheets. That's like basically data with less than a million rows. Then you have small data, basically any data that fits into the random access memory. So these datasets can be easily dealt with in R, in DataFrame or in Python, like in a tool like Pandas. But then on the other end, you have the big data, which really requires a cluster of computers to manipulate effectively. The reason why is because this data is so large that firstly it doesn't fit onto your RAM, doesn't fit on your hard drive, and it requires a lot of processing time. So therefore, it's better to just distribute them. But 99 percent of the cases, really you don't really have big data. They're required in a cluster to manipulate effectively. A lot of the time your data is simply fit in this segment called the medium data. Really, it's in the medium data bit that I really want to focus on. So basically, these are data that can fit on your hard drive, but doesn't fit into RAM. For a long time, R doesn't have a good tool to handle this type of data, and this is where DiskFrame comes in. DiskFrame is an R package, as I mentioned. It really tries to work with data that's too large for RAM, but still fits on your hard drive. So you can use one computer to manipulate them. And traditionally, people would be using things like SAS, or in Python they use DASC, or there's a new tool called VEX, and on Julia DB. Some people use Spark on a single computer, but that's really kind of defeats the purpose of Spark, I guess, but it kind of works. So this is what I think of as the ideal medium data tool. It's got to be free. It's got to be easy to set up. It's got to be fast. So a lot of people use databases for medium data, but I think databases, depending on whether you're lucky or not, you might be in a corporate environment where it's just not something that you can set up for yourself. So if you're lucky that in your company or your workplace that someone has set up a database for you, that's really nice, you know, you can query data from that, but that may not be accessible to everyone. So DiskFrame tries to be easy to set up as well. And roughly speaking, this is where I see where things really sit. So you've got Spark over here that's kind of like free, but a little bit slow, but actually it's not too bad. SAS is kind of like sensitive and slow at the same time. And then you have your DiskFrame and your DASC over on this area, so like free and reasonably fast. So how does DiskFrame handle such large dataset? Well, it's actually quite simple. You take a big dataset and you break it up into smaller files. So, and then you store each of the smaller files into a folder. And this folder with the many small files and maybe a metadata folder is what I call the DiskFrame format. So it's as simple as that. And there's a few things that happen when you break up a large dataset into smaller chunks. Firstly, you can process the chunks in parallel using the many cores on your computer because that will speed things up. When you store your data on disk, if you compress them, then you can make sure that you can load the data into memory faster, that will speed up a lot of things. You can also do things like, you know, apply some online algorithms so you can operate on the chunks more efficiently. So I'll see if I have time, but I'll demonstrate a use case where we fit a logistic regression with DiskFrame. So really, these are the key ideas of DiskFrame. And really, DiskFrame wouldn't be possible without the giants in the hardware, which are your data table. We use future and future.apply to parallelize all the tasks. And we use the FST file format. If you haven't come across FST, it's really great. You can store tables into really efficient formats and make them easy to load, fast to load, and it's got lots of amazing things. You can read a particular column from the table, or you can read particular rows from the table without having to load the whole data into memory. So DiskFrame makes use of FST a lot to speed things up. And of course, tidyverse is so popular and you'll see that DiskFrame, it makes use of a lot of the tidyverse, especially the plier verbs. And really, the reason why I developed DiskFrame was when I was working of a big bank, we used to have this big dataset, my 350 million rows, about 200 columns. And all I wanted to do was select some column one from table. I just want to know the sum of something. And it was taking 25 minutes in SAS. So I mean, after a while, I got sick of it and I just developed DiskFrame so that I can break the dataset into smaller pieces, access all the datasets, and I was able to do the same task in three seconds. And the problem with SAS was that it's not using all your CPU cores and it's not taking advantage of the fact that most PCs now have SSDs. So it's not taking advantage of the fastest access and it's not taking advantage of the fact that you have multiple cores and you can parallelize a lot of your processes. So as we go through this tutorial, you'll see that DiskFrame tries to take advantage of, at least the parallel processing should be obvious why it takes care of that. And FST is really good at reading data from in random access. So it's good for reading data from the SSD on hard drive. Okay, so that's a little bit about DiskFrame. Next, I'll go through the DiskFrame website because this contains a lot of very common questions that people ask. It's just diskframed.com. So like some of the common questions like why created? I guess, I don't know how many of you have run into this but it always comes up with things like error creating a vector of this size. That means your computer has run out enough memory. So DiskFrame will kind of get around that by letting you handle large data sets that sits on your hard drive. How's it different to data frame and data table? Data frame and data table sits in memory entirely. DiskFrame, our data set to sit on disk. So you have to process them by loading one chunk at a time into memory. And how does DiskFrame work? I'll go through a lot about all the deploy R and how it makes use of everything in the DiskFrame. I'll show you all the functions like CMAP, how they work, et cetera, et cetera. So I think I'll just go through some of that. And actually, if you go to the DiskFrame website and you wanna learn more about DiskFrame you can go to the references to look at all the functions that's been defined with DiskFrame. I've written a few articles or posts that discuss various aspects of DiskFrame. If you wanna learn more about DiskFrame just feel free to browse through them and learn more that way. Okay, so let's go into the tutorial proper. As I mentioned, DiskFrame is all about manipulating data on disk and how it works. And so if you have access to the code which has been posted into the chat, basically you can clone the repository which is basically DiskFrame tutorial already 2020. And I'll go through the code starting from one, two, three, four in that order. And if you wanna run through the code while I'm talking about this, feel free to do that. So how I set it up is on the left is my R Studio where I run through the code and on the right is basically the result of running through the code. So then we can just talk about the results if we need to. And there's sometimes some pictures on the right as well. So basically DiskFrame is on-prem. So if you want to install DiskFrame, simply just run install packages. And that's it. And I'll be using this package called the NYC Flux 13 which contains some data sets that I wanna use for this tutorial. Yeah. And basically now when you, so when you run, actually I'll show you what actually happened when you run DiskFrame. So for example, if I, one thing I've loaded it, so around library DiskFrame, it printed out a bunch of messages and you see that there's some messages that I've printed out in color. So basically this tells you a few things. It says we're using one worker with DiskFrame. If you wanna use multiple workers, just run this. So as I mentioned, DiskFrame tries to parallelize a lot of the process. So after you do a library, it's highly recommended you just run set up DiskFrame function because what they will do is it will set up multiple R sessions in the background for you. So then every time you run the DiskFrame operations, you'll try to parallelize them. So for example, if I run this now, it will go through, look at how many CPU cores I have on my computer and it makes available as many workers as there are CPU cores. So now I have six workers that I'm working with. If I go to the task manager, what you'll see is that, if I look at the R, oops, you see that I have at least a few R sessions opened up like this. You see three here and you see three below, so I have a total of six should be. So that's what set up DiskFrame does. It sets up a few R sessions, some of them are in the background, and so then when you tell DiskFrame to run something, it'll make use of all those cores. So that's what happens when you library a DiskFrame, it'll always tell you to set up DiskFrame first. And there's another setting that it will tell you to do and that's really optional, which is this one, option futures global Mac size. So the R workers, they need to talk to each other. So what this option does, if you run this, is that it says allow unlimited amount of data to be sent from worker to worker. That could be good if you're really dealing with really large datasets, but you may wanna just not set that initially, just to test out your algorithm because it can take a long time to pass lots of data between the different sessions. Yep, okay. So this is all the boilerplate that I want you to do. It's really, really simple. Just do set up DiskFrame and that's it. Actually, one thing I will show right now is that when you do set up DiskFrame, if you kind of like, feel like you should have something to look at, so you can just set GUI equals to true. And if you have shiny installed, you open up this settings page where you can just play around with order settings. So currently there's only two settings in DiskFrame, but as I'm more and more setting, you can just open up the GUI. So you can set the setting using the GUI instead. So for example, I can set the workers to four or six and then close it and that should be the same thing. So you can set GUI equals to true to show the graphical settings, okay? So that's why I wanted to talk about in terms of setup. You really do want to have multiple workers if you want to take advantage of the fact that you have multiple CPU cores. So I highly recommend wanting to set up DiskFrame every time you use DiskFrame. Okay, so let's go to the basics now. How do you go about using DiskFrame? So this is the disk tag. And what I'll go through is if you go on the web, you see that there's a spark, there's a tutorial for manipulating this dataset and in the plier, there's a tutorial for manipulating this dataset. And basically this is just me replicating that tutorial. So so that we can show the functionality how they are similar. So as I mentioned before, you library DiskFrame once you install it and set up the DiskFrame that's just what I've shown before. And that will set up multiple workers for use with DiskFrame, okay? So how does it actually work? Well, basically DiskFrame requires you to store the data in the DiskFrame format, as I mentioned before. And the DiskFrame format is nothing but a bunch of FST files. So what I have done here is I've done a library or the other libraries I need. And the first thing is I create a temporary path. So just a temporary path in the temporary directory so it gets deleted once I close R. And this is the first function I'll show you how to do. Now inside the flights, inside the end, just to make it clear, inside this package, there's this dataset called Flites. And all that I'm doing is converting this data frame into a DiskFrame. And I'll tell it, because as I mentioned, DiskFrame is on Disk format, so you have to tell it where to save the outputs. And I'm just setting overwrite because the truth because I've done this before, so I want to override it. So you just run that. Oops. I'm kind of expecting this to show. I don't know why it's not showing. Okay, but I'm kind of confused now because when I run this, it usually shows the result at the bottom. So I don't know why it's not showing at this time. That's interesting. I never had this happen to me before. But anyway, maybe I just... Oh, it could be because I have turned the settings off. Okay, maybe I run this. Make sure I run all the chunks just so that I get all the settings correct. It's still not showing. But anyway, if anyone knows what the issue might be, maybe let me know in the chat. Oh, I'm pretty sure I tested this. Anyway, if that did work properly, when you run that, when you convert something to a disk frame, this is what you will see. If you type it, it will tell you that where the disk frame is stored. It'll tell you how many chunks there are. No printing results on the... Yeah, it's kind of weird. Yeah, it's not printing on mine as well. I'm kind of wondering why that might be. Okay, same result as me. So... But it seemed to be printing for the other one, so I don't know, maybe that one was... The setting was off for some reason. And I can't actually tell why that might be the case. Show, code and output. Apply, run again. But anyway, it doesn't matter. I'm just trying to print out this one, which you can see on the right-hand side here anyway. So that's what you should see. It will tell you how many rows there are and how many columns there are. And I've implemented a few convenience functions for these. So given a disk frame, you want to know how many rows there are, and the end row will work. And the end column will work, just like in base R. It will just tell you how many columns and rows there are. It will tell you where everything is stored. So if you go to this one, where I just run a directory on the output path, what you should see is, yeah, it prints out that, oh, you have six chunks and these are all the filenames. And of course, the directory just prints out the list of files that's in the directory. Check, output, okay. I have a comment saying, check the knit button, the configuration icon beside the knit button, not beside it. Oh, configuration icon. Yeah, so chunk output in line. So that should do it in line. Well, anyway, it seems to be only not working for this one particular cell. So I'll just skip that cell. It seems to be working here, for example, so I'll just ignore that for now. Okay, so yeah, so if you recall, this is where I put the output and it tells me there's six files in output. And as I mentioned, if you don't know about FST, check them out, they're really great. One of my favorite packages, it makes, you can load datasets so, so quick, you know. So let's, to show you how this FST works, basically when you can save data frame as FST file, so all I'm doing now is I'm using the FST package to read the FST file. And if I run that, what you see is that it just returns a data frame. So I give you, so if I do print type class, if I print the class, you see that it's a data frame and the result is this. So that's basically all that there is to this frame. It's just a bunch of data frames saved as FST for fast read and write. And you can actually directly access these FSTs as if they are just stored data frames. And the numbering system is just, yeah, one to however many chunks you have. That's basically all that there is. And this frame just applies, just allows you to have some functions on top of that to manipulate these chunks very efficiently. So I'll show you how to do that. As I mentioned, each file is a product chunk and we try to work with multiple chunks in parallel. One of the secrets to this thing. So if you think about that, a lot of the deploy operations like filter and mutate, there's no reason to work with one chunk at a time. If you filter a chunk using some criterion, you might as well filter them in parallel and combine the results together. It's no different to doing them one by one. So obviously filter and mutate can be operated on in parallel. And certain group buys and summarized can be done in parallel as well. I'll show you how to do group buys a little bit later on in the tutorial. Definitely get to that, okay? So let's look at one example. So this is how you would go about filtering, filtering for the year 2003. Oh, this is incorrect code, of course. So if you recall, I've converted it into a data frame to a flights disk frame and I'm just applying the player code here to filter it. And you see this collect here? Collect is very important. So this is exactly the same if you have used Spark. So if you want to actually get the dataset back as a data frame, you need to call collect. So that's the purpose of this collect here. And I'll go through what will happen if you don't have collect as we go through the tutorial. But obviously, you can just run this. It'll just work. I'll just look at this dataset, filter it for this and return the results back into the session. Oh, okay. For this cell is also not printing. Hmm, oh well. Oh, I think I know what might have happened it could be because this cell was not a runnable before. So it kind of retains that setting. That's why I assume it's happening. Okay, but maybe keep it not runnable for now. So I can explain the concept of the same thing. So the best way to think about how this frame works is when you run this, it actually translates it into something like this. So maybe if I show it over here, that's actually a lot clearer. Easy to read. So when I run this code, what actually, so yeah, double equals here to checking equality. What it actually does is it loads all the other files on disk. Yep. Okay. Yep. So it reads all the files and if you're familiar with future dot apply. So basically how future L applied this first to normal L apply is that it runs everything in parallel. So we have L apply just runs over the files sequentially. So that's the only difference. So I run this in parallel. What I run in parallel, I literally just run read the FST in parallel, you know. And not only that, I think this is kind of inaccurate. So I run this in parallel. What I also do is for each chunk, I just do this filter and I just return that, you know. So currently that's kind of not 100% correct. And what do I then do? If I run this and filter it, well, I just combine them together, you know. So there's four different ways to combine. So I make sure everyone, you know, is covered. So if you like base R, you can combine them together using R-bind like this. If you prefer data table, it's the same as calling R-bind list. If you prefer to deploy R, it's the same as calling bind rows. And if you prefer per, it's exactly the same thing. It's just, you know, combine this together and call it. So the mental model for this frame is really, when I run this over this frame, think about it in this frame where it goes. It applies the same operation to every chunk and then combine other chunks for you. So that's what this frame does. It just translates that into this. Obviously, you'll much prefer to write this than write this. But in a nutshell, that's what this frame does. Just runs everything in parallel and then combine the results. And of course, if you're interested in the number of workers, you can just type this future number of workers as I mentioned. This frame just uses future in the background. So any options that you run that affects the future will also affect this frame. So in this case, on my computer, I have six workers and that's how you can figure it out. And as I mentioned, if you wanna change the number of workers, just change it to whatever number you want. Now, some people have told me that set up this frame is that idea for if you run on a server with 100 cores because that will start 100 workers. So maybe on a server, if you don't wanna talk about the resources, maybe you can just try to be prudent and put eight instead of 100. But normally I just run set up workers and it'll just bring up as many workers as the RCTU course. So in my case, there will be six. That's how you find out. Okay. So I'll talk a lot about roughly how this thing works. Now I go through, well, you know, great. You have a data frame, you can convert it to a disk frame, but normally if the data set is large enough, they will pass you the data in the form of like a CSV or something or some other format or maybe a database. So in the section three, which is the next section, I'll talk a little bit about data ingestion techniques, but I guess the most common thing that you can do is, yeah, take a CSV from someone and this CSV could be really, really large. So I have examples where I read like a 20 gig CSV using a disk frame, no problem. So I'll go through some of the tricks that this thing does to allow the user to read really large CSVs. So anyway, roughly how it works is basically, you know, or I'm doing his, I'm just taking the flights data set and I'm writing it out into a CSV location. So that's where I keep the CSV. And I define where I want my disk frame to sit. Again, I put it into a temporary directory and this is all I'm doing. So I provide a function called CSV to disk frame and it's pretty obvious how to use it. You just tell it where the CSV is, you tell it where you want to store the disk frame and I just put it overwrite. It was true because I've done it before and I want to overwrite the previous results. So again, you just run that. So actually, maybe my setting is completely off. I need to restart this later. But yeah, but if you did run that, that's what happens. Yeah, it just reads the CSV, tells you where it is exactly as before, nothing new there. Okay, I want to talk about this function. It's called zip to disk frame. So if you have a lot of CSV file zipped inside a zip file, you can call this function and it will basically unzip the file and convert every CSV in there to a disk frame. But you can look at a documentation for that. I won't cover this, but it's kind of pretty obvious also how to use it as well. Okay, so now that we'll talk about ingesting data a lot more later on, but roughly, you just call this function and it will read it for you. Actually, the other thing that can happen is, if your data is really large, like really large, you want to read the CSV a little bit at a time. You don't want to read the whole CSV because they will max out your memory. So you can just set this option called in chunk size equals to this. Then you can read, in this case, read the CSV 100,000 lines at once. Yeah, okay, yeah, yeah. So if I, okay, so the flight DF just to clarify, this is the correct printout. It doesn't print the content of the disk frame at all. So I might have to add a feature where it prints a little bit of the CSV, but what you, so this is the correct output as intended right now. So if you want to see a little bit of content of the data frame, you can, this frame you can just go ahead and you'll print out the head or you can do a tail, both of these functions work. But currently it doesn't automatically print out, doesn't automatically print out a glimpse of the data frame, which probably it should. I just haven't put that feature in there yet. So it should print out a little bit. But this is the correct output. It just tells you what it is. It doesn't show anything else. That's correct. Okay, so what we have here is basically, as I mentioned here, if your data is huge, you can think about setting this parameter, but actually what actually happens behind the scene is this frame will try to be smart. You look at your data set and if it thinks the data is too big, it'll try to guesstimate, you try to guesstimate a number for this, and it will try the number. And if, so basically all you have to do most of the time you just run that without specifying a chunk size. But of course, if this fails, you should just go back and set the chunk size. But from my limited testing, I say limited because I only tested on a few big data sets. This works pretty fine. Without it, it works fine because this frame estimates what is the chunk size for you. So you usually don't have to do that. But anyway, okay, so let me try to make this a bit smaller. So actually, let me go through it on this side. So now I'm up to running the deployer verbs on a disk frame. So this is the corresponding section here. So if you just scan through this code, you see that I'm taking a disk frame, I'm filtering it, I'm selecting column, I'm mutating it. And once I'm happy with it, I just collect, is collected, collect the data in your memory. Of course, if your data is huge, you probably don't want to collect, but this is what demonstration purposes for now. I'll show you how to deal with larger data sets later on. And once you collect it back in your memory, you can use whatever deployer, because once you collect the data becomes a data frame. So that's actually, that's pretty normal stuff. Just deployer, collect, which you probably have seen if you use things like Spark. But what's this curious thing at the top, you know? It says, I read this as source keep. A source keep, month, day, carrier, deployer, time distance. So what source keep does is it says, well, the rest of your code only makes use of these columns. So why don't I just only load these columns into memory? So I guess the common question I have is, well, how is source keep different to select? So source keep is different to select in that it only loads these into memory, whereas if you call select, it loads every column into memory and then filter it down to these columns. So where possible, use source keep as a first statement to tell this frame that, hey, don't load every column into memory, only load this. So this way you can cut down massively on your memory usage. And that's actually one of the key features that allowed by this frame, FST, because FST lets you load only the memory that you want into memory. And if you do that, your program's gonna be a lot faster. Now, a potential future work could be that we make this frame work so that it analyzes this piece of code and figure out exactly what columns are used and implicitly do this. So you don't have to do it manually. For now, I haven't implemented that. So my advice is, if a program takes a long time to run, think about using source keep to only load the columns that you need into memory. But apart from this, everything else should be familiar. You can just do a lot of the deploy our verbs. And one of the things that you should see now is what I should do now is explain the concept of laziness. So you see, I have broken up this thing into this part and then a collect and then something else. So these frame operations are lazy by default. So which means that when I just run this, it hasn't actually done anything except store in memory the fact that you want to run this. So I'll show you what I mean. So what I'm doing here is basically, oops, maybe it's a bit too large. So maybe like that. Okay, maybe I make it a little bit bigger just so that the indentation is not too annoying. So what I'm doing is I'm timing the first bit. This is the first bit of the code before I run any collect. So if I run that, you should see that it only took this much time. So no time at all. But that's because it hasn't done any computation. All that it's done is it's taken these and store them as instructions. So let me actually run that. You see that it takes no time at all. And I'll show you the internals of this time a little bit. Just say if I look at that and I go, what's the attribute of this guy? The lazy functions. Oops, let me see if then, oops. Oh, I kind of forgot my attributes. Oh, sorry. Of course it's like that. Lazy functions. Yeah, there we go. I kind of forgot what I was doing. Anyway, so when you run a program without collect, you see that in the attribute called lazy functions, it stored a bunch of things like all of these things. But all of this information, what it does is basically just says, you want to, I'm going to remember what you want to do in this, all of this structure. But it hasn't carried them out yet. It says, this is what you want to do, but I'll carry it out when you actually call collect. So that's this extra attribute in lazy function that stores all of this information. So that's why this takes no time at all because all that it's done, it just store this information into this lazy function attribute. So what happens when you call collect? Also it doesn't take that much time, but you can see it's taking a long, longer. When you call collect, it actually carries them out. You know, that's why it takes a little bit longer. Like that's actually where the computation happens. And as I mentioned, once you call collect, it's back, put back into memory. Okay, so, but basically, you know, if your data set can fit into memory, there's no reason to use this claim. So all of that, which is to demonstrate roughly what happens. I'll show you how to actually manipulate data on disk in the next few sections, but this is just to illustrate what's happening. Illustrate the fact that this frame tries to be lazy. It only tries to store the instructions in a form, in some form. And then only when you tell it to, it carries them out. That's the whole idea. And this is the list of the plier verbs that I've implemented. So actually for the full list, you might wanna just check the diskframe.com website. And I'll try to answer questions at the end. I see a lot of questions in the chat. So support the plier verbs. So all of these are supported. You see that some of them, I put a chunk in front of it, chunk array. So normally in the plier, a range arranges the whole dataset, a sort set basically. So I made a chunk version of it. So what that means is, it only arranges it within chunks. Actually, this actually is not complete. Someone actually contributed an arrange function. So actually, as far as chunk arrange, there's also an arrange. So for the updated list of the plier verb, check the diskframe.com website. You can do chunk group by and chunk summarize, same idea. You do the group by and summarize within chunks. But of course, I also have implemented group by and summarize. So group by and summarize without the chunk underscore means do group by on the whole diskframe. Exactly the same as you would do on the data frame. And of course, you take, transmute a bunch of joins I implemented for full join I've set to be very careful because the algorithm for full join will make it a little bit slow. And you'll see the reasons why a little bit in the next sections, I've talked about all of these joins. Very quickly about group by, as I mentioned, in group by, it's just exactly as in the plier, almost no change. The exception is that there are only certain summarization functions I implemented. So in this case, n mean I implemented, it doesn't implement every single function because when you chunk the data, it's actually a lot harder to do group by properly. So a lot of the functions are not implemented. And some of the functions are only done approximately. For example, like things like median. We'll talk a little bit about that later on. Yeah, so if I want to hammer away, the number one take away for now, which is kind of annoying, the most annoying thing I put for this thing right now is that if you want your program to be fast, you have to tell it which columns to load using this source keep function. There's some of those annoying things hopefully can get rid of in future releases. But yeah, that's the part of the number one take away if your program is slow, just keep only the columns that you need. And as you can see here, actually it doesn't really show in this one too much. But normally what happens is for large data sets, I'll show an example running a 1.8 billion rows data set where if you don't do this, the whole analysis could take 30 minutes. But if you do this, it's only one minute. It's a huge difference between the things. And as I mentioned, you're gonna do joins. Does anyone notice what is the join that's missing in here? No, I got less join, inner join, semi-join, inner join. I don't think I've implemented NT join, but actually I have to try to remember. Yes, of course the join that's missing is right join. So in this frame, we can do a join. The left table always has to be this frame. If you wanna do a right join, just convert to a left join because we just don't allow that. Okay, so this is some examples of running through the joins. So if you're running in the notebook, you can have a crack at running it. But roughly speaking, so what you see here is I take in the airlines data set and I convert it to a data table. Actually, I don't have to convert to a data table. I can keep it as a data frame, it works the same, but I just wanna show that it kind of like it's, yeah, it doesn't really matter whether it's a data table or data frame, it works the same. This is this frame, if you remember. So you can left join this frame to a data frame, no problems, right? So the right-hand side table, if it's data frame, no problem, you can do whatever you want. And it will just join normally, just like you would. Now, if I do something slightly different, if I convert the data frames table to a disk frame, I convert the airlines table to a disk frame, and I can still do the joins, no problems. It will give you exactly the same results, except of course now you should see that it takes a bit longer. And also to give you this warning, and I'll explain this warning. So if you use this frame to do joins and you see warnings like this, I wanna explain that so that you know what's happening. So this is warning, okay? When you do this join, most chunk by chunk ID equals to false, okay? This will take significantly longer, you know, and the preparation needed perform eagerly, which may lead to poor performance, okay? Consider making Y. So Y in this case is always the right-hand side table because if you look at the documentation, I believe you should say, you know, left joining X and Y, so Y is always right-hand side table. So consider making Y a data frame or set merge by chunk ID equals to true. But what does that actually mean? What does merge by chunk ID mean, right? So I'll explain that a little bit. So when you merge two disk frames together by default, merge chunk ID is set to false. And this is what's happening behind the scenes on the left-hand side because each disk frame is made up of chunks. To do the joining properly, it has to compare each chunk in disk frame one with each chunk in disk frame two. You have to do this many comparisons. So as you can see, there's a lot of comparisons. So that's gonna be a lot more computationally intensive. But if you don't compare every chunk with every chunk, you can't do the joins properly, okay? So if you set merge by chunk ID equals to, oh, this is wrong. Okay, so let's make that into chunk ID equals to true. Okay, so that's not false. This is the case where it's true. This is, okay, but anyway, should be obvious. So if you do it equals to true, then they only match the first one with the first one, second one, first one, third one, and so on and so forth. So a lot less comparisons, a lot faster. So that's what this is referring to. So when you merge by carrier, ideally you should be able to set this equal to true. So then you have less comparison. But I guess the question is, if you do it chunk by chunk, how do you know that it's doing the merge correctly? So for example, what if carrier A is in both this frame one and two? Then if you just do this view, then this carrier A is not merged correctly. And that's what I try to explain with the sharding function. So this sharding is probably one of those more advanced concepts that you probably don't use that much often, but it helps when you wanna do things like group by and joining. So I'll spend a little bit of time explaining that. I see that we have 45 minutes to pass, but I'll try to speed things up a little bit. Okay, so how does it work? What does sharding mean? So if you shard a data frame by column one and two, then what happens is all the datasets with columns with the same values in column one and two will end up in the same chunk. So for example, if I'm sharding by carrier, if I say take the flight's DF and I shard it and I shard it by carrier, what happens is carrier A will always end up in the same chunk. I probably cannot tell you whether it's chunk one or chunk two or chunk three, but this guarantees that they'll be in the same chunk. And no matter, it's always gonna be there. So actually what I should have done, of course, when you want to just to be extra safe, you should do this. You should do n chunks equals to some number, maybe six. n chunks equals to six. Because I'm merging, because I wanna do merge by chunk ID, I don't want the two disk frames to have different number of chunks, otherwise I have a mismatch. So what you should really do is set the example here, set the number of chunks to be equal as well across the two. If you do that, then it'll always be true that carrier A in this disk frame, which is the sharded version of this, will always be in the same chunk number as in here. So that's what sharding does. It's basically for every value in carrier, it applies a hash function. And this hash function is deterministic. So what that means is it's always gonna tell you the same number. And if the number of chunks is always the same, for example, six, then it'll compute the hash and the hash is an integer output and it computes a module of the hash to the chunk number. So that's why given the same carrier, they'll always end up in the same chunk, provided they share the same number of chunks. And so in this case, if you shard them beforehand, then you can left join and you can set the merge by chunk ID equal to true. Then that will speed it up a lot. In this case, you see, if I sharded them beforehand and I run them, it only takes two seconds. If I don't, and I just left join, the two disk frame together, it takes four seconds. So it's more than double. This is not such a big problem for this flat dataset because it's only like 350,000 rows. But imagine if you're merging data sets with hundreds of millions of rows. You really want to have a think about how you wanna structure the dataset. You probably want to shard them. And one of the best ways to shard them when you're reading in the data. So you didn't have to do a later one because doing it here is probably very expensive. Yeah, there's a few ways you can shard the data frame. You can go do that and we chunk. Typically speaking, when you shard them, the number of chunks doesn't change. It tries to use exactly what was the chunk here, chunk size here to shard them. So if you start with six, end up with six. But of course you can set the end chunks equals to something to change them. So that's how sharding works. But of course the question is, is sharding itself extensive? I know that once I shard them and I merge, they are a lot quicker. But this is very expensive as well. So that's why typically the advice that I have is when you run a program with this frame and you have to really think about how I want to be sharding my data. And typically it's really simple stuff like, I wanna shard my data frame by custom ID and that's it. So all customers end up with the same custom ID and something the same chunk, something like that. So that's a little bit of planning I have to put into it. But it will speed up all your group buys and it will speed up all your joints if you apply this shard concept properly. But that's another performance tip, I don't know. Plan ahead and think about how you want your data to be distributed. Typically it's a custom ID account ID, very simple stuff. Do them when you're reading the data. Try not to do them here because that's literally reorganizing the data sets on the hard drive. So if you have like a 100 gig file, it's a 100 gig worth of data moving around. So that's to be avoided if it can be. But actually, that's why I love the databases. If you try to re-index or repartition, it takes a long time, it's exactly the same reason. It's doing a lot of data movements. So same with any other database system, have to plan ahead a little bit. But otherwise it's fine, you can just do whatever you want. So that's a little bit about grouping. Arbitrary window functions. So as I mentioned before, you can do group buy in this frame no problem. But currently this is the list of group buy, summarize function that's supported. So it doesn't support everything. I have an article on diskframe.com to show how to add more if you want. But it's really, it's currently this is the list that's provided out of the box. So for minimum max means some length and distinct standard deviation, variance, any, that's all, that's for booleans. These are what so-called exact. As I mentioned before, anything like interview with rank, like median, quantile, you know, inter-quantile range, these are estimates because it's actually really hard to get the exact median if your data is spread out across multiple channels. And I get asked this question so many times that I actually did go and research this. Diskframe can only give you estimates. I researched Spark. Looks like Spark is also only able to give you estimates probably for exactly the same reason because the data is actually shattered into smaller partitions or chunks. So it's actually not able to give you an exact median. But I've done some testing. Unless my data set is really strange, I really find it difficult to come up with the data set where even if I shattered it and I estimate the median, then my median is very far off. It's almost like really hard to come up with an example where it doesn't work, but just want to be 100% clear that that's the limitation. All of these are estimates only, they're not exact. Okay, now we provide group by and we also provide two other group by functions which I won't talk too much about. And you can go to diskframe.com and read some of the examples, but I'll just briefly describe them. So chunk group by was exactly the same as you would a data frame. So if you do group by group sum, then you would just sum the mean of X didn't actually compute the mean. Even though X could be sitting in all the different chunks, it would do it correctly. You compute the mean for your current group. Now, if you only want to do group by within its chunk, then you use the chunk group by. Now, if you want, if some operations are not here and you really need to have the data group by something before you do them, you do the hard group by. And the hard means whatever column you're grouping by, say group, it will reorganize the data set basically sharding it in the background into grouping by the group by column. So in this example, if I do a hard group by a year month's day, it will resharp the data set into year month's day. Using year month's day as the sharding algorithm columns. And then maybe for some algorithm, you do really want that. But yeah, you have to think really hard before you apply it because I put the hard in front of it, I guess, to warn people that it takes a long time. So, but if you do group by and you stick to these summary functions, you're probably fine and the performance probably gonna be okay. But I won't talk too much about hard group by or chunk group by, but I'm free to ask questions on GitHub or anything like that. Okay. Now, okay, I guess to round up what a basics, right? I guess a really common question that I hear is basically, okay, cool, you have dplyr verbs. What if I wanna do something slightly different to the dplyr verbs? I wanna do something basically a bit arbitrary. And this is where something like this frame is a lot better than things like spark r or some other system where you have to talk to the database because the key selling point of this frame, one of the key selling point is that you can apply arbitrary r functions onto the chunks. And I provide two main functions to do that. One's got delayed, one's got cmap. And they're basically the same thing. cmap is more general. Delay is basically cmap with lazy set to true. So it's delayed is always lazy. So for example, I take my disk frame and I apply the nroll function to count the number of rows in each chunk. I guess this will count. So what delay does is it takes a disk frame and apply this function to every chunk. For those of you, you may recognize that this is the per syntax. So yeah, that defines a function. And the first argument to the function is always .x and the second is .y, et cetera. So yeah, I'm just doing the nroll function. And if I want my results to be returned as a list and I call collect list instead of collect, yeah, so say I have six chunks, and I collect list on nrolls, you tell me how many rows there are in each chunk. So if you don't want to delay, you can do cmap. So of course, C in the cmap stands for chunk. So it applies the same function to each chunk. And I set lazy equals to false, then it will automatically run it. If I set by default, lazy is true. So if you set lazy equal to true, it doesn't run it until you got collect, same concept. Okay, slightly more complicated examples. What if I wanted to do, okay, this is actually not correct. So let me put that here. So instead of, okay, if you run map, it will tell you, okay, it's not correct, it'll give you a warning. You'll say that map is deprecated. Please use cmap. So what I found was that map was a bit confusing. I just, cmap just emphasised the fact that you're running over each chunk. So I set the deprecate map. But anyway, it's the same concept, cmap of this. And the arbitrary function I want to apply is I want to return the first 10 rows in each chunk. That's what I want to do. And I say lazy equals to false, and I want to output it to this. And you can just find that. You know, it works. So map and delayed allow the user to specify arbitrary function we want to apply to each chunk. And then we just apply it. We'll see a little bit more examples in the section four when we talk about advanced. But roughly, that's what you can do. So you can do, for example, examples where I put in some r function like the broom. So within each chunk, I fit a logistic regression. And then I use broom to turn that coefficient into a table and I return that, you know? So you can do arbitrary things, whatever you can think of. You can use any r function, any r package you come across. You're not just limited to whatever verb is provided by some other system. So if you deal with a database, you know, if the database doesn't provide this r function, for example, you want to do some text mining. And that text mining function is not in the database. You can do it in this frame because you can apply any r function you want. And it will just work. And of course, we round out the tutorial for this section. I've implemented the sample section. So you can sample a section from this frame if you want. And lastly, you can take a disk frame and write it out somewhere else. So of course, you don't just take original disk frame and write it out. You probably want to take the disk frame. You want to take a disk frame, you know, do something to it like filter, and then do something else, do goodbye, you know? Or whatever, or do something you take. Do some stuff and then you can write it out to somewhere else. So that's one way to store it. And again, I have more examples of that in the section four, okay? So that gives you a good, hopefully gives you a good overview of, yeah, good overview of where all the functionalities of disk frame. And, you know, I check my timer. It's just past the one hour mark. And I'll continue for another 40 minutes going through section three and four. And then we'll leave 20 minutes at the end for questions. So far, this is what's been covered. You know, we show that disk frame you can do, you can apply the Playa verbs for data manipulation. And I've shown the list of the Playa verbs. Please do check diskframe.com. Check the references section to see if your favorite verb is there. If it's not, you know, let me know. That's how you can do that. Or you can, and we talked about joins, how you can do joins and how the sharding concept can help in things like joins and grouping. And we talked about how group bind and summarizations are possible or talk a lot more about that in section four. And of course, we show that we've implemented some functions like sample fraction and widened disk frame. And there's a bunch more functions as here that I won't have come to cover, but please do check through them. You wanna learn more about diskframe? Okay, so let me take a sip of water and we're just gonna move on to number three, ingesting data and let me enlarge that. Yeah, yeah, yeah. Thanks for commenting. Got a comment saying that CMAP is great. Yeah, so that's a whole, I guess that's why one of the reasons, CMAP is one of the reason why, some people might prefer disk frame to like a database or Slack, where in those other systems, you can't really apply all the R functions that's available in the R ecosystem, unless you take a performance penalty. But I mean, you can actually load some data from their database into R, apply the R function and then transform the R dataset back into the format that they want and then push it back into the database. But that translation between a database to a data frame and then from a data frame back to a database, those two pieces of work typically is gonna be very time consuming because it's actually typically very inefficient to transform your datasets from R to, back to the database format. Now there are recently, there's things like arrow that's meant to help with that, but I guess we have to wait and see to see how effective those things are. The good thing about disk frame is that there's no translation because as I showed before, all that it's doing is, it's loading the FST into data frame format and you can manipulate it using just normal R code. And if you want, you'll save it back to FST. So at no point, you have to convert between R and some other system with arrow or not because arrow doesn't even need to get involved because it's just basically R to R. Yeah. Yeah. I'll show how to do the C map without per syntax in the, I believe in section four. There's some examples of that. Yeah, you can define function as per normal. Okay. So I've given you an overview of how R works, how disk frame works. And now I wanna go over, I guess one of the biggest obstacles to using any new data system, which is how do you even get the data into that form? And no disk frame is no different. And what I try to do is I try to provide as many examples or convenient functions as possible I can. Some of them are not even implemented. So for example, I have a very fast implementation of SAS that converts SAS to disk frame format and that's not in the disk frame repo. And in this one, I'll show another function that's not in the disk frame repo for converting databases to disk frame. But let me start with the ingesting data. As I mentioned before, I keep showing this. You can do CSV to disk frame. Actually, CSV might confuse some people, but to me, like if you have pipe separated, data files, PSV, or you have tabs separated CSVs, all of those I call CSVs. I know it's technically not correct because CSV sends some comma separated, but anything pipes separated, tabs separated, you throw it into this function, it'll just work. I just couldn't be bothered creating a PSV and TSP function just for that. And actually, disk frame doesn't provide any CSV reader. I just reused the CSV reader from DataTable, which is every, which is the fastest one out there. It can reuse the reader one. I'll talk a little bit about that. But why don't you even mention DataTable? So what that means is in CSV, after the other basic suspects in terms of input file and out directory, you see that there's a dot, dot, dot. There's an ellipse, ellipses. You can do this. So typically anything you put here will be passed to the different readers. So you don't know the Fread and DataTable, the ReadR and the LAS. So yeah, so you can make use of all the functionalities and all the other readers. You can know what you're doing. Okay, firstly, what I do is I set up the DataTable library and I write the Flux dataset into a CSV and I read it back as per before. That's just what I've shown before. And I show that once you load it back, you can read the dataset. Now, as I mentioned, there's multiple backends that's being used in disk frame. And I have, occasionally people come into the forum and they say, oh, your CSV reading function doesn't do this thing or it's got this issue. Why don't you look at that? And I go, okay, cool. This sounds like a beginner issue. Perhaps you could look at it and try to submit a PR and they look at the CSV code and that's too complicated, I mean, like so. So I admit, I probably need to do some really basic, I need to do some refactoring of the readCSC code. And actually, even I can choose reading my own CSV back in code, but it kind of just works, you know? But I want to describe high level, the logic, what it does. It tries to do a lot of magic and the logic is very convoluted. So I'm going to refactor at some point. Okay, so this is what CSV, so if you want disk frame to read the CSV and you call this function, this is what happens. Okay, it looks at the dataset. If you think, okay, I can fit this dataset to RAM, no problems, then it will just read everything and just read the CSV and that's it. If it thinks the file is too large, if you try to read like 100 gig CSV and you'll notice that disk frame probably works fine. There's probably no issues there. But what it actually does is it uses this big reader package and it splits the file into smaller files. Okay, so it takes a 100 gig file and splits it into however many smaller files and it reads the smaller files simultaneously like using multiple workers. So I have found this to be the fastest approach. So a few problems that can arise. What actually can happen? What actually can happen? So yeah, so my point is that all of these are very complicated. You should read the documentation if you're interested. People with that haven't used my package say that, okay, this is really well documented. I don't know if that's true but I tried to document it as well as I can. So yeah, so if you spend time reading the documentation, you should be able to find something in there that might be useful. There's a lot of things I try to document. And I probably can't go through all the options because just yeah, quite a few. And try to read on the ingest data article on disk claim. You wish you can access here. Now, so my point being that, this is just really high level description of the logic. The logic is actually a lot more complicated. It's like 10 different things going on. So you should really just try it out. If it run to bugs, please report a bug. Otherwise, I try to document all the options. Have a look at the options to see if one of them will suit your needs. And yeah, try to read this ingesting data article. So to figure out what can kind of happen. No, so now, so in this case, because this is another thing that makes the code really convoluted. So for example, if I take the same flash dataset and I write it out as one and two. Yeah, so I'm just showing an example where I write the same dataset out as one and two. But the key point here is not that the datasets are the same, but the fact that I have multiple CSC files with roughly similar columns, okay? Then you can actually just pass in, as your first argument, instead of passing the path to one CSC, we can pass at the path to multiple CSC's. And this frame will try to read them and concatenate them into a disk frame for you. And this is the output when you run this. It's not the output, it's what you see printed on screen. You just print a bunch of these things. It says, okay, you wanna read multiple messages. Please use cold cluster to set column types. You minimize the chance of failure. So what do I mean by that? And this is actually one of the major pain points with reading CSC's, because CSC don't come with types. One of the most common ways I see people fail to load a dataset into disk frame is if they have two CSC's that's meant to have the same column and column types. But when the CSC reader tries to infer the types, in my, for example, in column, in the first table, maybe it thinks column A is an integer. And then in column B, it thinks column A is actually a string because it contains that, N slash A. Then the two CSC's will lead to different column types. And disk frame doesn't let you append to disk frame with two different types together. It just wouldn't let you. So I'll show you a quick example of that happening. So say I have a day A and I want to do as disk frame, and it's a most simple data frame that you can think of. So A equals to one, two, three, okay? That's it. So that's, so my A only contains, oh yeah, head of A, okay. So yeah, collect A. So that's it. That's as good as my disk frame is. So I'm gonna define another disk frame called B. All I'm doing is A equals to, but this time instead of one, two, three, I do A, B, C, okay? So A and B are two different, basically. They have the same column name, but they're different. Different column types, you know? So, and the way to combine two disk frames into one is using this R bind list. And I do a list, R bind list.disk frame, and I do a list of B and B. So I try to concatenate them, row bind them into one. It should just fail, you know? Can't combine these two because one has integer, one has character. And this is a very common thing that can happen, especially if your CSVs are spread out across multiple files. So that's why the number of recommendation is set this call classes. So tell disk frame what columns there are. Otherwise, the underlined CSV readers may infer different column types for different, for column to see the different CSVs. And that's the number one, number one issue for causing it to fail. So I will emphasize that because that's always the danger with reading CSVs. The column types are mismatched across different files when you fail. So this is the way to solve it. Set the column classes using the call classes parameter. Okay? And it will explain to you, try to explain to you exactly what's doing. You know, it does it in two stages. It's converting the two CSV to six disk frames each. You know, confusing a six-chain chunk. And then, you know, that's stage one. And then stage two, it converts the row binds them together. So sort of try to exchange you exactly what happened. And it tells you across the two stages, how long it took. So there's an algorithm behind it. When you try to learn multiple CSVs, but I guess my point is that it can be conflated and you know, do refer to the help and do refer to online article or ask questions on GitHub if you run into issues reading CSVs. But otherwise, I guess this is as intuitive as it is to me, you know, just passing multiple CSV files. Obviously you have to make sure the CSV has the same columns and the columns have roughly the same types. Otherwise, CSV will just refuse to combine them together for you. Okay. Okay. So that's about reading CSVs. So again, a lot of questions like, okay, what if I get something out of the CSV? That could get a little tricky and I have to develop more readers. But the function that I recommend everyone to be familiar with if you have to do custom stuff with this disk frame is this add chunk function. So add chunk is actually really easy. You take an existing disk frame, you provide it with a data frame and you'll add that data frame as a new chunk. So yeah, so if you already have chunks one, two, three, you call add chunk, you add chunk four, you know? And you can be even more cute than that because FST is just a folder with one, two, three. Imagine you can write a program that produces the FST's number one to n simultaneously and just store it in the folder and that would be about a disk frame as long as the columns are the same and the columns are the same type within each of these things. Okay, but for this one, I'll focus on how to use add chunk to build up a disk frame chunk by chunk. So this is a bit of a contrived example but I just want to demonstrate how to use add chunk not necessarily how you want to do this because sometimes this may not be the thing. So what I'm gonna do is I'm gonna set up a SQL-like database. Very simple stuff. So I connect to a SQL-like database. I create a table called flights and it's our friend, the flights dataset. And the thing that I wanna demonstrate is, you can actually use this syntax where you select from everything but limit how many rows you return and provide an offset. So yeah, offset just says don't start looking until offset n where n is the number of rows you skip path. So yeah, in this one skip nothing, read 100 rows. That's what you return. And if you think about how you go about using that, you can just go, you can just build up a function where you go. I'm gonna read, I'm gonna figure out how many rows there are in the data table. So I'm just doing a select count style. And of course, it returns some data frames to obtain the row count as an integer like this. And then I just compute a bunch of offsets. Some offsets are set to zero. And then if I read 50,000 rows, my next offset is 50,000 and so on and so forth. So I just compute a bunch of offsets to be passed into the offset column. And so all this does is basically like a loop. If you're familiar with per syntax, this is just a loop over the offsets one by one, hence building up this select statement. So select from table, order by row number, because I don't want to do it some other way. And obviously this is a bit inefficient, but as I mentioned, this is just a contrived example to show how you can load from a much large database and put the content into this thing. You probably don't want to do this with a proper database. You probably want to do something a bit more efficient, but I think that's the idea. So you select from this, order by the row number, and you limit how many you return by the chunk size and you do an offset. So the first time you do it is you go offset zero. So start from row zero, return 50,000 rows, jump to the mix offset, which is 50,000 and return 50,000 rows. Jump to the mix offset, which is 100,000 and return 50,000 rows. And every time I return a new chunk, what do I do? I just add chunk, add it to my disk frame. And that's all. So you can see how I can basically take read data from a much large database and put them into a disk frame. At some point, I probably should make this into a proper function that's provided by this frame or for now. You get the idea. And as long as you have a mechanism to read from the other file, it could be like a parquet file or whatever format, and you want to build up a disk frame, you can build up using add chunk. And in add chunk, there's also a chunk ID parameter, which is defaults to null. So if you don't set chunk ID, you just figure out how many chunks there are in your disk frame and add one to it. So you already have three chunks, you add what you call add chunk, you add the fourth chunk and call it four.fst. And if you try to be cute and you try to use the chunk ID and set it as one to N and use a parallelized process, you can actually build up a disk frame by using add chunk in parallel. So for example, I can do this in parallel and build up my disk frame. As long as I keep track of where, what I should say is a chunk ID instead of relying on disk frame to provide me with a chunk ID. I hope the idea is clear. So yeah, as you see here, I can just call this function and I can build up a disk frame myself just like this. Okay, so hopefully that gives you an idea of how you can ingest data into a disk frame in various modes. If you're reading from a CSC, TSC, PSC, whatever it may be, call it a CSC function. You can even load multiple files into it if you want. Or if you have some really customized needs, try to think about using the add chunk function to build up a disk frame. So if you use add chunk, you can do arbitrary, you can talk to arbitrary systems and build it up. So that's the ingesting data section that I want to cover. And I only have 20 minutes left and I want to cover this, what's called a beyond basics very quickly. But actually everything in beyond basics is something you have seen before. I covered in the overall session. So it's like group by and C map and delay, okay? So yeah, nothing new here, but a little bit more depth. Okay, so in group by, as I mentioned, that's just all the boilerplates, okay? I set up a data set. Actually, I want to run this live. So this is probably one of the things you can't run live yourself because it's a data set that sits on my local drive. So taking this data set called a Fannie made it, actually, I might help popularize this Fannie made it. It's such a great data set. And Rapids AI have really kindly made it available for testing their Rapids GPU system. So if you Google Fannie made Rapids AI, you'll be brought to this page where they have made it available, the Fannie made data set in a very easy to download format. So you want to test large data sets, definitely go to this website. It's like literally, you can just right click and download the whole thing. And if you go to the Fannie made website proper, you have to sign up and do, you know, building different things to get the data. But in this way, it just is downloaded, you know? And I downloaded this data to test out there, Rapids GPU system, which is like a GPU data frame system, pretty interesting. But yeah, this is a really good way to get large data sets to test with. But anyway, actually, I got this data from way before. But you can see here, I've constructed this data frame with 1.8 billion rows, about 31 columns. There's 168 chunks in the data. So, you know, it's all the normal stuff you can do, you can do head. And I want to show you this. I want to do a group by, group by the month. So that's the month. I want to do a bunch of summarizations of the mean of this and the median of this. I run this, you should see that it takes no time, like one second. And as I mentioned, that's just storing everything. Storing the instruction, I haven't run anything. This is where this is so key. Because I source keep, because I'm only using these three columns, I do this here. If I don't do this, this section will be a pain, it will be like 30 minutes. But if I only keep the three columns that I use for my analysis, when I collect on this group frame, it only takes about one to two minutes. So you can see that now it's actually taking serious time. But it's actually summarizing 1.8 billion rows, using a group by on my desktop, which is actually reasonably powerful, but I've done the same demo before, and basically at the user with a much less powerful desktop. And I want to really show you this. So when you run that group by, you can see that all my CPU cores are being used. And memory doesn't get shot up that much. But yeah, that's why this frame is fast, because it tries to do all of these group by simultaneously and do some smarts behind the scenes to combine it in there for you. So you're just doing a group by, and the only annoying thing you have to do is this. But everything else is just familiar to you, it should be anyway. And give that a couple of minutes and it'll be done. But I just really want to show you live how much data it can handle on a computer without having to distribute it over whatever and just have the results returned back in about one or two minutes. It'll tell you the time when it's done. Okay, so that's all the things, 1.8 billion rows. Yeah, okay, it kind of varies. Sometimes it's 60 seconds, some of it's 80 seconds. And this predicate, yeah, okay. Oh, it's finished running, 86 seconds, yeah, roughly speaking. So, and you can see the results. It just is exactly as you would get from a normal data frame I tested multiple times. Yeah, they give you the same results. Just an example of how to summarize large data sets on a computer without having to use Spark and stuff like that, okay. And I even included a picture of the CPU usage, how it's all been used up. And let me emphasize one last time. If you want your program to be fast, think about using SourceKeeper. If you don't do that, yeah, it's gonna be. Oh, how much RAM did I use up? That's actually a good question when you see. Yeah, is it even tearing off? It actually used about 25 gigabytes, roughly. So, typically, and actually, the other thing about memory usage, which I actually, now I remember, I haven't put it into the talk. So I better quickly explain. Now you see here, it's got 168 chunks. And if you recall, I have six cores. So what this frame does is it loads up six chunks of the 168 into memory at once. Now, if your computer is less powerful, then you can actually just increase the number of chunks. So then you don't use up as much memory. So for example, if you double the number of chunks to 300 and you load six out of 300 something, then you're gonna use a lot less memory than 25 gigabytes. So that's, the chunking is the mechanism by which this frame helps you manage the amount of RAM that you need. And one thing I didn't mention, and also one of the reason why the CSV function is so, is so convoluted and the way that it's written is that in there, you also see a section where it goes, okay, your CSV is this big, you have this much RAM, and you have, you know, so therefore, and you have this many CPU. So therefore, I think you should cut your disk frame into this many chunks. So it's got logic to try and infer how many chunks that you should cut your dataset into. Now for me, because my desktop is quite powerful with 64 gig of RAM, I can cut it into 168 chunks. But for less powerful laptop, you may even double the number of chunks that you do. And you will still not run out of memory. But it will just, of course, take a bit longer. Of course, the smaller the chunk, the better because machines are very efficient in loading like one big file to memory and then running everything from there. Okay, but anyway, memory usage very important, but you can control it by the number of chunks. And actually, if you, for example, if you've got a disk frame from a colleague and the colleague may have a less powerful computer and you got this, how? And you say, okay, I want to increase the number of chunks, how do I do that? Just use this function called chunk, passing a disk frame, and you know, so in this case, I have a Fannie Mae disk frame. I may be like Fannie Mae, so I can use an N chunks function to check out how many chunks there are. And say my computer less powerful, so I want to do N chunks times two. That's my new chunks, new N chunks, okay? I call it B chunks, let's call it N chunks properly. Then you can do Fannie Mae and then you do rechunk. Rechunk, and then you go N chunks equals to new N chunks. So using the rechunk function, you can basically adjust the number of chunks in your data frame. And that's typically very useful if you move a disk frame from one computer to the next and the computer become less or more powerful, so you can just adjust the number of chunks this way. I guess theoretically, I can also provide a function to guess what is the optimal number of chunks, but I haven't done that yet. So you kind of have to take a stab at it and guess what's the optimal number of chunks. Okay, so that's roughly how it works. Okay, and as I was trying to emphasize this, do source keep as much as you can. I know it's a bit annoying, but to me that's the only annoying thing, really annoying thing about disk frame at the moment. Try to improve in the future. And as I mentioned, group by, you can do all of these functions. And roughly speaking, if you think about how to do group by if your dataset is actually chunked based. The way it does it is basically it reduces. So for example, in this example, I'm showing how to do a sum. How to do a sum of N. Create sumN, group by ID, but IDA could be two different chunks. So the way to do that of course is in parallel you summarize the chunks. Basically sum up the Ns, group by within each chunk. So you end up with this and then you collect them together into one big data frame. Even then your data frame is still incorrect because A has come from two different chunks. Then you have to do another summarize to make it into the final one. So this frame does all that for your automatically. But it only does it for these functions. And if you want to learn more, I think we probably don't have time to cover how to define your own. So if you want to define your own group by function, check out diskframe.com. Again, there's an article there called custom one stage group by function. You'll explain to you how you can define your own function for doing this, but basically you have to provide two functions for each summarization operation. And the two functions correspond to the two stages. The first stage is summarizing each chunk in parallel and then concatenating them. And the second function you have to provide is how do you then group this together into the correct form? So yeah. So that's, you need to provide functions that does the parallelization and one for reduction. So if you provide these two section functions, you can define arbitrary group by functions yourself. No bonus. Some limitations about group by, that's not allowed currently in diskframe. Normally this is allowed, but it's not allowed in diskframe for now. And why it's not allowed because it's nesting a bunch of things. So it's taking the mean of disk divided by the max of some other value. This is allowed a normal deployer, but not allowed because you're nesting the max inside mean. So that's one of those things that can improve in the future. But currently understand that, when you do group by, you probably wanna just stick with one level of it. Don't try to nest them. Otherwise your function is not gonna work very well. Yes. Yes, the two functions are basically a much more simplified version of reduce because currently all this spark framework and stuff, they do a shuffle and then they let you do the map reduce. Eventually it's like map shuffle reduce or some combinations of that. But diskframe doesn't allow you to shuffle step. So you can't make the different workers talk to each other but you can do like a map reduce. And that's the only thing you can do, which is a much simpler. Okay, so lastly, we'll talk about C map. As I mentioned before, we've shown before C map, you can do arbitrary functions to each chunk. And in the past, I've shown the per syntax and this is an example where you just pass the normal function. And I like to call my function to take in one argument called chunk. And in this function, right, I'm just doing number of rows of the chunk times the number of column times the number of rows. So that I call the number of cells, you know? And as usual, when I call C map on this, it doesn't actually compute anything, just says lazy results. So there's two ways I can make it do the computation. I can say lazy equals to false in which case it will just return this or I can tell it to say collect list and we'll collect it. So, you know, take a disk frame, apply an arbitrary function to it. And if when you're returning isn't a disk frame, isn't a data frame, if your function doesn't return a data frame, in this case, it just returns an integer, you're better off calling collect list. So you'll return a list of results. If you return a data frame, you can just call collect, in which case you'll just concatenate them into a data frame for you. Another example, as I mentioned, you can just do a good end cells equals to end column time enroll and make a new data frame for its chunk. And as I mentioned, if you return a data frame, you can just call collect. And it will just concatenate them for you into a data frame. And if you're familiar with PER, then you'll know that you see the C map is very familiar because I just took the design from PER. And there's some other function that I haven't talked about which is CI map. So the I means, instead of providing a function with one argument, I have a second argument. And a second argument tells me how many, basically just a counter that goes from one to three to five. So you can do that. C map DFR, so DFR means data frame row bind. So, but it's sort of exactly the same thing as C map except you're more explicit about how you treat the data frames. And they are basically taken from PER. So if you learn about PER, when you shift to this frame, these functions should be familiar. Okay, very quickly to close off. So that's all great. You can run C map, you can collect, but a lot of the time, I don't wanna collect my data because that's the whole point of using this frame because my data is huge. I wanna do some operation to it and maybe save it somewhere else or maybe just deal with it some other way. I don't wanna collect it. So this is what you can do. You can call a C map. In this case, for each chunk, I do a mutate. Of course, because I'm only doing a mutate, you could have done it outside, just using a mutate function, but this is just to demonstrate what happened. So, I wanna mutate this chunk and I wanna make the output directory this. So you're gonna write somewhere around here to set the output directory. I said this is the flight with the flight date. And you have to set lazy equals to false and depending on what you want, you wanna set overwrite equals to true. So for me, I've run this multiple times, I just said overwrite the output. So that's how you can apply some transformation and save it somewhere else using C map. Of course, sometimes you don't really... You wanna do something more complicated with C map. So in this case, I'll just go this frame and inside C map. And in here, I'm using the per syntax. So in a per syntax, dot x is the first argument. So that's just the chunk. So what do I do inside? I just, for each chunk, I wanna fit a GRM model. And I use the broom package to make the GI model into a table. And you can see this is where this frame is more powerful than other system because it allows you to use arbitrary R functions as I mentioned before. You know, I've done that. And then I wanna rename the p-value column and the standard deviation column. And that's it. And you can see here, I haven't got collect. This one is just another disk frame. And I can collect on it. You can see it brings back the broom or the parameters on the broom model fits. And what if now, instead of collecting on a write this thing to somewhere else, well, I just used a write this claim function. I just go, okay, instead of collecting it, just write to somewhere else. And again, overwrite equals to true. And it will have written this to somewhere else and told you what it contains now. Or sometimes you don't actually wanna save it. You wanna save it. You just go, I'm gonna apply this new tape and nothing has happened yet because it's all lazy. You can just save the lazy data frame as is to a file using the save RBS. And then when you load it back, you can, it will still work because actually what this frame does is, it save this instruction in the disk frame so that when you save it, the instruction of what you wanted to is also gets saved. So yeah, so you can just save a lazy data frame and load it back at any point. It will just work. And this is just an example of showing I'm using the QS package to save it. Yeah, exactly the same thing. You apply some function to it. This function is applied lazily and you can use QS to save it and load it back and it still works just like a normal disk frame. Now this is actually a bit more tricky. So in here, I do X equals to 100. And in my disk frame, I do departure time is equal to departure time plus X. Of course, X is this X 100 here and you can see that it's done the computation correctly. Normally this would be about 517. And I add 100 to it is 617. So that's fine, it's all good. By the way, you have this getChunk function to get the first chunk or the end chunk but just to get there. So I'll finish very quickly but the point here being that if you use a global variable and you try to save it, okay, you save it and I remove the X. So at this point X doesn't no longer appears in the global scope. And I read it back and I look at this, this data that I just read back. You see that it's still does the computation correctly. So what disk frame does is every time you run an operation if you rely on a global variable it'll try to save a copy of the global variable. So then you don't have to keep the global variable around the next time you open it up again. So that's another way how it can be, how it's kind of like you can just save it and expect it to work when you run it back because all the global variables are captured and then reload it back. Of course, it's got these limitations. Things like the XGBoost models, which doesn't actually work which you have to save properly for it to work and you just leave in the global variable. So I believe it's almost like a pointer to it in C++ that then it wouldn't work. But for simple things like this, actually disk frame keeps a copy of it. So it knows exactly what X was when you ran it. The danger of that is you could have global variables that contain huge amount of stuff you could have a global data frame that's got 100 million words and you save it. And yeah, this screen will try to keep track of it and that's actually really inefficient. So you also have to watch out how many global variables that you use. But otherwise it's generally safe to use global variables in your computation and then save it and then load it back at a later date and you can expect things to work. Okay, we really have to finish this now to leave enough time for questions. In the beyond basic section, we'll talk about group by, we'll talk about how C-map works. We show more examples of how disk frame is lazy by default and how we can use things like C-map and write disk frame and you can use things like save RDS and use the QS save to save your lazy data frames, the lazy disk frames and we'll use them later on. And you're gonna build a pipeline that sort of does lots of operations on the disk frame and then have the other end the prepared dataset. You can do it a few ways. You can do it completely lazily and only save all the instructions in between and then sort of replay them or you can use the right disk frame to say, I applied all of these operations, now save it somewhere else and then keep doing the next thing. So you have a few options there. Okay, I think it's time for Q&A now but I don't have time to go through a lot of the stuff but feel free to go to GitHub and raise an issue or go to diskframe.com or just read some of the articles and actually I'm very responsive on the GitHub. So if you have any issues, just raise an issue there and I'll answer them. And I think it's time to move to the question. Sorry I took about a bit longer than usual but yeah, time for questions. So I think, do you wanna answer the last one first CJ and then I'll privately paste all the questions I compiled to you via chat and you can work through those? Okay, sure, I'll work through the questions. I guess this is the, okay, so let me go from the back. What is my typical use case? Okay, so for me, one of the most useful things for me with diskframe is I would take a big dataset and just convert it to a diskframe. Then I can do a lot of summarizations like group buys. They wanna look at the trends over the last 10 months and I just do a filter and then group buy. So I almost use it like a reporting thing where I just, I can do group buys a lot quicker using one computer. The other thing that I do is basically just manipulate the, for example, a Fermi made dataset into a dataset I can use for modeling. I typically wrangle the whole dataset and then I use the sample frack to sample a small enough fraction that I can then use it for other datasets. So I see myself using it as, I build up this huge database but I call it a database by just the diskframe sitting on my disk and every quarter when I get new data I just add chunk to it, add a new chunk to the data frame and I can use the sample frack to sample from it. So I can build like logistic regression models of the data. So those are my two main uses. One is to, is it a fast group buy tool? The other one is to, is I got surfs at the base for my sampling. And I'm pretty sure on the disk frame, there's an article on how to fit logistic regression using this frame. Yeah, so you can fit logistic regression with some, with some caveats. And, but typically I don't use disk frame direct to fit logistic regression because that restricts me on what packages I can use. I just sample it and then build my models from that. Different R users have different typical shapes of data. Okay, so data sets could have tens, could have many rows or could have many columns. How does disk frame perform varies with and depth? So as I mentioned that disk frame, you can control a disk frame by the number of chunks that it has. So, and chunks, so yeah, this one would have 168 chunks and there's many rows and there's many columns. And actually, because this frame uses data table in the background, roughly speaking per chunk basis, it has the same performance characteristic as data table. So in this frame, the additional control you have is the number of chunks you can have. So per chunk wise, no difference to data table. So it's a question of how many chunks you have and how you efficiently manage the chunks. So yeah, so whatever the characteristic of data table is, it will inherit that. Also, you need to be careful about chunks. That's my answer to it. Okay, yeah, sure. Please feel free to email them to me. I have my, I'll open up my email on another screen and I can see all the questions. Let me scroll back and see what are the questions there are. I answered the question about group by, it's yes, it's okay. When I just, as a question, when I just want to edit medium data, do operations on each cell not summarize? How do I make sure the lazy changes are committed before saving the modified dataset to a CSV? I cannot use collect, okay, yeah. So, and Raphael, so the short answer is, you just use the right disk frame function. So actually, let's look at this one. Let's look at, oh, I don't know what's happened. Okay, yeah. So if you remember, I have this disk frame called disk frame A. Let's just look at how it looks like. I know in this one you can collect. So you can do it like this. So, and I know your answer, the question is not to collect them. So say, I want to multiply every column in A by two. And I want to make sure that I've done the right thing and save it somewhere else. This is what you can do. So you can go A1 is equal to A and I want to mutate the column A by time taken by two, okay? So A1 theoretically is disk frame A, but with column A mutated by two. So how can I see that? There's a few ways. You can go ahead of A1. Yep, and you can see that it's already changed it from the one to two. Actually, just to show you, I know you don't need to call collect, because it's such a small desert, yeah. You can see that it's done it correctly. Or you can go A1. I want to look at the first chunk. That's usually efficient enough. Or you can look at the second chunk or the third chunk. It's just so happens that it broken up a three row column. A three row there is into three chunks, but you get the idea. It's basically, yeah. So you can check whether it's done the right thing. And once you've done, you've seen that it's done the right thing by checking head or get chunk, you know? Once you've insured that it's done the right thing, then you can just go A1, write disk frame and out path to .df or something. And maybe call it A2. Then you will have saved A2 to this new path and with the correct results. And you can check again, you can go head, A2, yep, correct. Get chunk of A2 correct. So that's how you can check that it's done the right thing. Okay, I've got the questions from Juliana on the, from Gmail. So instead of answering the questions from the bottom up, give some people that ask questions at the beginning. So is it possible to run this frame on GPU, like NVIDIA? The question is, where would data table H2O fit big or medium? Actually, data table is completely in memory. So given that my definition of medium data is anything that doesn't fit into RAM, but fits on your disk, data table technically is fit into a small data paradigm. But you can get computers, you can rent EC2 instances with four terabyte of RAM. So it's four terabytes small data, probably not. So, but anyway, that's my definition. So I'm talking about laptops and desktop. So yeah, I'm not so familiar with H2O. I don't know what else it provides beyond data table. If it's got its own module for manipulating data on disk, then it's medium data. If not, if only user's data table is small data. Okay, next question. Is it possible to run this frame on GPUs at NVIDIA? This frame doesn't connect to NVIDIA direct as such, but I believe you can do something like this and it will work. You can go A to this and then you go C map. What am I C mapping to? I'm going to C map this. I'm better off writing it here. So you can do something like this. This is my disk frame. And I want to maybe C map this. What do I want to C map this to? I want to C map this to, you know, some, for each chunk. I want to, you know, convert it, convert to XGBoos format, you know, and then, you know, XGBoos, XGBoos, and then, you know, GPU, because it's true. Something like that. Technically, this is not, technically, this is not, this is not correct syntax, but you get the idea, you know? I believe in XGBoos, I could be wrong in this actually. I believe in XGBoos, there's an option to let you run the models on GPU. So there's nothing stopping you from running that because this frame lets you run arbitrary R code, but does this frame use GPU directly? No, you have to use it through some other package, yeah. Is future applied like for each apply? Yes, except I much prefer future because future detects all your global variables for you. So you don't have to say, okay, I want to pass this global variable to the designer session. It detects them for you. I think long, long time ago, when I used for each, you have to specify, you know, exactly, I want to use this package, I want to use these globals, and it gets passed to the other sessions. And when I started using future, I didn't have to. So that's how I got, I just kept using future. It's probably the most popular system out there. And I'm not sure if for each has changed now that it can detect the global variables for you, but I don't think it has, but so yeah. Our next question, can a subset always one in memory? Can it return a disk frame? If the, yeah, exactly. So yeah, as I mentioned, a disk frame can return a disk frame. So yeah, A equals A times three. So if you remember A is a disk frame, that is, yeah. So if I add to A1 equals to that. A1, A, you take A equals A times three, A equals A times three, and you look at the type, you look at the class A1, it's still a disk frame. It's still a disk frame. It's basically just A, but with this instruction stored. Yeah, and it returns that. And A1 works exactly the same as A. You can collect on it. If you collect on A, you see that, yeah. One column is multiple by three in the other one, yeah. If you just call a disk frame, you only see its description and output, not the content. Is there a way to print the content? Yes. I've shown the head, tails, getChump, you can use those things. Do source keep carry over from one chunk of code? I don't know what that means. Yes. So source keep only modifies the, well, only modifies the column that it, only modifies the column that it comes after. So for example, A source keep, it doesn't modify globally. It only modifies A, if you know what I mean. So this source keep only affects its A. It doesn't mean that it affects everything. So it affects every chunk inside A. That's charting changes about chunks. Yeah, I've already answered that. It doesn't, if you have different charting into this frame, do we get an error? Does it still do the job cross one in chunks? Okay. If you have, if you're two, if when you set mergeChump ID equals to true, yeah, mergeByChump ID equals to true, it kind of assumes that you know what you're doing. So if you have mismatching number of chunks, I don't think it throws an error. It throws a warning, but I have to check that. But I don't remember putting in a warning to check it. But the idea is the moment you do this, this claim assumes that you know that, you know, what you're doing. So you're meant to be doing this. So yeah. Questions about arbitrary section. Does it have to be a formula? Could it be a custom function? I'm thinking about calling heuristic net and doing something to it, CNN? Yes, yes, answer by others in the chat. Yes. Well, yeah, as I mentioned, the whole attraction of this thing is that inside this C map, you can do whatever you want. Any R package you can load, you can just do it here. So you can do it. Is there a problem that CSC reading behavior changes when the file reaches a certain size? Can you force behavior? Yes. You can force it to do whatever you choose, actually. But the problem is that the documentation is a bit dense. So for example, you can choose things like, you can even choose what backend you use. If you don't like data table for some reason, you can just choose a backend equals to reader and it will do that. You can do things like, you say, okay, you see how a function got recommend, recommend n chunks? Is it, okay, it'll recommend like say 168 chunks? You say, I don't want that. I want 200 chunks. You can override that. Yeah, so it's everything and also dot dot dot gets passed to the reader. So if you're using data table, everything you put in a dot dot dot gets passed to the F-read. If you're doing reader, the dot dot dot gets passed to the, I believe, read CSV. Yeah. So there's a lot of control over what you can do. But for example, if I choose backend equals to data table and I do chunk reader equals to read lines, in the logic, it might be doing something completely different. Actually, I think we're over two hours. So last question. Could there be an option to force second file to have the code class of the first? I read a lot of data where I don't know the type. You mean automatically? No, but if you set the code, maybe that's a good option. So yeah, there's no, there's no option saying, you know, follow the code class of first, you know, equals to true. There's no option to do that. You kind of have to obtain the code classes of the first chunk manually and set code classes equals to whatever that you set, something. Yeah, so that's, but I guess once you set code classes, the same code classes gets applied to every file you read. So that way it's, but yeah, there's no convenience function to do what you said. But it could be an interesting thing. I don't see any other questions or anything that's new. Yeah, I guess join with Ms. Chatmash chunk could be useful in some, so yes, I mean, it's kind of like the recycling in R, you know, it's useful for some things, but typically you don't want to do it, you know, like for example, this could be useful. I don't know for some use cases, but do you want to do that? Maybe, yeah, some clever algorithm probably want to do that. But anyway, thank you very much. I think I've gone just a little bit over time, but thank you everyone for attending. I believe this will be recorded. This was recorded and will be shared online. So yeah, and also as I mentioned, any questions, feel free to go to GitHub and click on, yeah, feel free to go on GitHub, you know, click on issues and ask a question. Yeah, I try to be as responsive as I can and thank you very much for attending. Awesome, thank you so much, DJ. This is really, really cool. And yeah, just for everyone else, I'll send you an email through the meetup itself like you guys received my emails saying it was coming up and whatnot with the link to where you can see the video and also a short survey from USAR 2020. So keep an eye out for those. Thanks, everybody. Thank you so much, DJ. It was really, really awesome. Great, thank you very much. Thank you for having me. I guess that's it. Yeah, have a good night or morning, everybody. Yeah, everybody, thanks. Wherever you are in the world. Thanks. I'll probably talk to you guys later. Good now. Thanks, bye. Bye, thanks.