 I'm not sure if this is a news flash, but not everyone has heard of tidy data. I know. Shocker. Well, 10 years ago, me didn't know about tidy data either, and so I created a file format that could potentially have thousands and even hundreds of thousands of columns. In today's episode, I'm going to review how we can read in a very wide data frame into R with thousands of columns, or then convert that to a tidy data frame with only three columns without losing any information. We'll see how to do this with functions from the readR and dplyr packages from the tidy verse, as well as functions from the data.table package, which will run far faster than those from the tidyverse. Before we get going too far in today's episode, I'm Pat Schloss, and this is Code Club. In each episode of Code Club, I present various concepts that I use in my own research to improve the reproducibility of the analysis. Please be sure to subscribe to the channel and hit that bell icon so you know when the next episode is released. So what do I mean by tidy data? Well, according to Hadley Wickham, who conceptualized tidy data, a tidy data frame follows three rules. First, each variable must have its own column. Second, each observation must have its own row. And third, each value must have its own cell. We can violate the first rule, for example, by measuring temperatures on multiple days and putting each day as a separate column in the data frame. In reality, we only have two variables, temperature and day. And they should each have their own column. We can violate the second rule by putting different variables on different rows, or different lines of the data frame. For example, we could have weather data for a bunch of cities. For each city, we could have a row for today's temperature and a row for the amount of precipitation that fell today. A tidy approach, however, would be to have columns for the name of the city, the date, and then separate columns for the temperature and the amount of precipitation. We can violate the third rule by having a column where the cell contains both the temperature and the amount of precipitation. That cell might say 70 degrees Fahrenheit and half an inch of rain. Instead, we'd like to have separate columns for both variables, where one column would have the temperature of 70, and the other would have the amount of precipitation of say half an inch. Aside from being easier to work with, are processes tidy data far more efficiently than untidy data? Finally, we should appreciate that the tidiness of the data really does depend on the application and the goal of the analysis. In the last episode, we saw that we could use mother to generate a shared file that indicates the number of times each Amplicon sequence variant, or ASV, occurs in each genome. The rows contain data for the genomes, the columns represent each ASV, and the cell indicates the count. That resulted in a data frame that had close to 16,000 rows and several thousand columns. Thinking about the rules of tidied data, this data is not tidy, because I have many columns for the same variable, the ASV. Rather than having a column for the genome and a column for each ASV, a tidy data frame would have a column for the genome, a column for the ASV name, and a column for the number of times that each ASV appeared in the genome. Now stay with me. Even if you don't know what a 16S RNA gene sequence is, or what an ASV is, I know that you'll get something out of today's episode. Besides learning how to convert a tidy data frame into a tidy data frame, we'll also see how we can efficiently read in a wide data frame, because those live in the wild. As I mentioned earlier, R really struggles with wide data. I'll review the f-read function from the data.table package, which we saw in a previous episode, and it really excels at reading in data frames with a bunch of columns. You'll recall, hopefully, that in the last episode we worked on issue 34, create ASVs. Prior to that, we'd been working on exact sequence variants, ESVs, which were groups of sequences that were identical to each other. We called them ESVs, again, because they're exact, they're identical to each other, and they came from those genome sequences where we have a lot of confidence that the sequences were correct. In actual practice, these sequence variants, when they get clustered together by a program like, say, data2 or Dblur or UNoise or Mothers Precluster, there's a little bit of slop, because there's some sequencing error, or PCR artifacts that creep into the data, and so they're not exact, they're Amplicon sequence variants. There's some amount of denoising that goes in there. So we used ASVs, we used Mothers to generate those ASVs using Mothers cluster command, using very fine level distances, as well as then the Make Shared function to create a shared file. And so that's what we're going to talk about today, is how do we convert that shared table into a tidy data frame, also called a tibble within the tidyverse. And so we've largely done part three here, but we need to do that conversion, and we'll do that conversion in R. If I go ahead and open up my project, and I can show you where we were before, so we took about get ASVs, and we kind of completed here with running cluster, and Make Shared to create a shared file, and we finished with this datav4 rndbunique optmcc shared file, we'd like for this to instead be datav4rndb.say01, count tibble. And the Make File, we already have a Make File rule for that, which was down here somewhere, right here, right? So we've already got the rule, we can run it, it doesn't actually build the target because we need to bring in code to do that, and so the code we'll use to do that will be code forward slash convert shared to tibble.r, and I will go ahead and I'm going to copy that name, and we will then bring that, I'll go ahead and touch that, and we'll open up our project, and we'll be working on this code as we go along in today's episode, okay? And so if I come to files, I'm in my working directory, you can see here my project, and so if I go to code, and I want the convert shared to tibble, and this is here, and because I always forget the shebang line, I'm going to go ahead and copy the shebang line from one of these other files, and copy that to here, and the file, we're going to call convert shared to tibble, right? I can get rid of this, and the input is a shared file generated in mother, output will be a tibble or tidy version of the shared file, and we'll do library, tidy verse, okay? And again, I should have left that open. If I look at this, get ESVs, this will be good. I'll go ahead and copy these two lines, three lines, right? To take arguments in from the command line, as well as the input and output file name from those arguments. I'm going to comment these out for now, because as I'm testing, I want to tell it what files I want it to use. So I'll use input file, and that is going to be data v4, and it is going to be rndb, and I believe it was unique. If I come back to the terminal, data v4, and I see that what I want is this unique optimcc.shared, all right? So that's good. The output file is going to be the same thing effectively, well not really the same thing at all, is it? I will do 01 count tibble, and let me go ahead and run these different commands, these different lines of the code before I forget, always seem to forget. All right, so we want to read in the shared file, and we want to output it as a tibble. So let's go ahead and do read tsv input file, and let's see this is going to take a while, because it's got several thousand columns, and as I mentioned, r really sucks at reading in data, that's really wide, and you can see it's showing the progress bar for reading it in, it's how slow it is. Usually you never see that progress bar, and so what we see is that there are 15,578 rows, 3,555 columns, we have a column for label, that's the otu definition or ASV definition in this case, the group name, that is the genome name, and the number of otus or the number of ASVs in this case that follow. So this label column we don't need, this num otus column we don't need, and this group should really be called genome. So let's go ahead and bring that into our deployer pipeline, and again reading in a shared file is something that you might do in any kind of microbial ecology analysis where you're processing data outputted from mother. Like I said, when I was first developing mother, I didn't know about tidy versus untidy data, I just thought seems like a good idea to have data that looks like a wide spreadsheet. Oh well, we learned, right? So we could do select, and we could do minus label, minus num otus, and again we're going to run that, and it's going to be quite slow, but that'll be okay, and then we can do rename, and we can then rename, and it's going to again be the new name equals the old name, those groups, and let me double check, sometimes it's group, and sometimes it's yep, it's group, not group, so glad I checked. So we want group, and so we run all this, and again it's slow because of that initial read TSV step, right? Alrighty, so that's going to go reasonably quick, you know it's slow because we're impatient, but it's not really the end of the world, and the output we then see is we've got a genome column, and then are all of our otus, but as I mentioned we have two variables really here, right? We have the genome, and we have the otu name, and then we have the counts, and unfortunately we have all of our otus or ASVs as separate columns, and we'd like to bring them together to make our data tidy. So to make our data tidy we're going to do pivot longer, and we'll say minus genome, because we want a tidy or pivot longer, everything except for that genome column, the names two, there's going to be the name of the column that the column names go to, right? So there's going to be a column for all the otu names, and I'm going to call that ASV, and then we'll do values two, and that's going to be count, and so again if we run this, again it's rather slow. The advantage of making it tidy is that once we've got it tidy and we output it as this table, even if it's like a million lines long, it's going to be really fast to read in because r can handle long, it just really can't handle wide, and so we see now we've got 55,330,000 rows, right? And we've got three columns, the genome, the ASV, and the count, and we see that many of the counts, at least in this kind of first ten rows, are zero, and we can remove that by doing filter count not equal to zero, and this will then remove any row where count is zero. Again, it takes a while. While this is loading up, be sure you've liked and subscribed to the channel, be sure you click that bell icon so you know when the next video is released, so the pivot longer also seems to take a few moments, and what we get out is a tidy data frame where we've got 17,000 rows and three columns. Now this hopefully seems familiar to those of you who are loyal viewers of the code club because we did this before with Git ESVs, right? We went through in that episode, talked about different strategies for benchmarking how long things take, and we talked about the data table package, which has things like fread for really accelerating the performance of these reads of large data frames, especially wide data frames. So what I'd like to do is revisit that because the read ESV, or the git ESVs command, read in account table, and in account table the rows are where the ESVs or the ESVs and the columns were the genomes, so it's kind of transposed in a way in terms of like the file formats. So we're going to revisit and do it again and learn some new things along the way. So I'll go ahead and do library data table, be sure I load that, and I'm going to, I copied this, and I'm going to comment out the first one so we can compare and contrast how things look, and what I'll do is I'll remove that pipe and we'll replace this with fread, and so this is going to read in the data from it. Bam! It's done, right? It just looked like a second or two versus, you know, 10 or 20 seconds, 10 or 15 seconds using read tsv. So read tsv and the tidyverse package tools and packages and functions are nice because they're really easy to use and the syntax is easy, it's really well designed, but the cost of that is performance, that it tends to be a little bit slur. So fread does a really nice job of reading in those wide data frames. Another argument that we didn't talk about before with fread is the drop argument, and this allows us to give it column names that we want to drop. So instead of doing select minus label numotus, we can say drop and then using the c function to build a vector, we can say label and numotus, and if we run this and look at what this looks like, a lot of output. We see that it starts with label group numotus, why did it do that? With all this output, it gets really hard to see what's going on, and so I'm going to hit that broom icon to clean things up. So if I run that, and if I scroll back up to the top now, I see that I've got group and then all my otu names. So we're in good shape. So I don't need this select function. The next thing is to think about renaming that group column to genome. We can do that renaming using set names. And we will give it the old name and then the new name. And again, we don't need to give it x because that's going to come through the pipeline, and I see I forgot my pipe at the end of line 24, so we'll go ahead and put that in. So the old name was group, and the new name is going to be genome, and I'm going to go ahead and put those in quotes, and be sure to get my pipe in here, and let me go ahead and clean up the output so it's easier to see what's going on. And that all looks good. For some reason, it's not outputting that, and sometimes we can do last value. There. So last value allows you to output the last value that was generated. And scrolling back up here, we see that we now have changed group to genome. I'm not sure why it didn't output it. Maybe that's just a quirk of set names. So again, we don't need this rename function. We could use, let me, we could use rename, right? So we could get the same result here. And again, let me clean this up. So we'll get the same output that we had using set names, but using rename. So they both work well together. I'm showing you set names and some of these other functions that come from data.table as an alternative to some of the de-plyer and other tidyverse package functions that we're using. All right. So we've set the names. You can see that our three lines have now come down to two. And we're ready now to do the pivot longer. And instead of pivot longer, we're going to do melt. And we will give it id, id name. I forget the syntax because honestly, I don't use this package a whole lot because I mainly only need it when I read in these wide data frames. So I'll go ahead and do melt. And we can see id vars, variable name and variable value name. Okay. So we need id vars and that's going to be genome. And then we want variable name. And that's going to be asv. And then value.name is going to be count. Okay. So I think that should work. And bam. We see we get three columns, the genome, the asv and the count. So we're in good shape. And so that again got rid of this pivot longer step. Again, we could run it with pivot longer. It ends up taking a little bit longer as we see here, right? That pivot longer from the tidyverse is slower than melt. Does it matter? I don't know. It's up to you. I work a lot in the tidyverse so I like using pivot longer. And that extra couple seconds doesn't really cost me anything. Okay. So the final step is the filter. And again, we could run filter count not equal to zero. But that's the tidyverse way. The data.table way is a bit of a different syntax. And so we'll do period and then open close brace and then do count not equal to zero. And so what the period means in a pipeline is the data that's coming through the pipeline, right? We've seen that before with some of the inner joins. And so what we're doing is using the square bracket notation that you perhaps have seen before in a lot of programming with base R. And what this is doing is the same thing as filter, but it's keeping it within data.table. So we run this. And again, it's relatively quick. And we see that we no longer have those zero count values. So that's great. Okay. So the final thing we might want to do would be to do write underscore TSV. And we're going to output that to output file. And that's our pipeline, right? I guess I could for comparison, I could put this up here. And these two pipelines do the same thing, give you the same data. This first one is purely using tidyverse. The second one is using data.table. Again, both of them take a very wide data frame and make them tidy, which which is great. So I'll go ahead and save that I will get rid of these two input output file test names. And I will uncomment these. And this all looks good. And I need to make this executable. And we're going to then need to add this into our into our bash script. Alright, so maybe what I'll do before I do that is make sure it works, right? And so I will do chmod plus x code, convert shared to table. Good. And I can do code, convert shared to table. And then we're going to use data view for rn db unique opti mcc shared. And if it works, the output will be rn db dot 01. Ah, I didn't give it the output file name. It didn't complain. Let me see what it did. Yeah, it didn't output anything. So I need to give it the output file name. I'm gonna. Yeah. So we'll do data v4, rn db, 01 count, table run this. So within our you could put an error checking to make sure that the output file name wasn't blank. And so they're not sure enough, we've got rn db, 01 count, table. So we're in good shape. So again, we're going to come back to atom and look at our get ASVs. And then I will do code convert shared to table dot r. And then my input file is going to be this. And my output will be this. And that will be good, except I don't want to use those names. Actually, no, I because yeah, I don't want to because if I look at different regions, I'm going to get different values. And because I'm going to have different thresholds. And so it occurs to me, this is my target. So I can replace the second file name with dollar sign target. That looks good. And yeah, so we need to then create or figure out where in here we have, yeah, so we'll do a dollar sign stub to get everything from data v4 rn db. That looks good. And we can then add this to our garbage collection to remove that. And I think we're in good shape. So let's give this a shot. Maybe we'll be ambitious and throw this in with our use our make. So we'll do make data v4 r, rn db, 01 count tibble, what that rip. And this will take a moment or two. And we'll see how it goes. So it completed took a few minutes to make that shared file. And so again, if we do LS LTH, data v4, we see we've got our count tibble. And that's in good shape. So I'll go ahead and look at the top of this, make sure everything looks good, rn db, 01. And that looks great. We've got our three columns, just like we expected. And so again, we're in good shape. Something that you might think about doing for homework is that, again, we learned a couple new things that work with f read from the data dot table package that we did not incorporate when we wrote the original version of get ESVs. I think I forget to change the name here, but I'll leave it for now, that we still have this select and we still have the rename and we still have the filter. So see if you can adapt this code using what we learned in today's episode to make better utilization of the data dot table package. I don't think it's going to change the speed, it certainly will not change the output. But again, it helps you to have more tools in your tool belt for those different circumstances when, you know, perhaps your data is just too wide or too big to process with something like read TSP or read CSV. Okay. Great. And where we're at now, you'll recall that we have this rule to create count tibble, but we only have one distance threshold. So in the next episode, we'll see how we can modify this to have two unknown, so to speak, in the target name of our make file. And so we'll make a whole bunch of names for targets that we will then build. And this will require us to do a little bit of programming and make, which is always fun. I know not a lot of people use make, but maybe tune in and see what we do. And maybe you'll see, Hey, that's pretty cool. I could incorporate that to some of my knowledge. So be sure you do the homework as always, keep practicing, please tell your friends about Code Club. And we'll see you next time for another episode of Code Club.