 Welcome back to the next code club. I think this is our fourth code club that we've been working on I know many of you that have been watching or that are online tonight today Have been from the very beginning. And so the data set that we'll work with today should look fairly familiar It's a data set of Candy data where the folks at 538 looked at about 150 or so different types of candies and they broke down different characteristics of those candies Things like whether it had chocolate or fruity flavor whether it was in a bar or was a package of small pieces like Skittles or M&Ms They looked at like the sugar content and the price And then they asked people to tell them which candy they liked better And so they took pairs of candies and said do you like this or this this or this this or this and they came up with a Win percentage to forget which candies people like the most and so last week We started working with the function called filter and we used the other function with it called count And so that filter function allows us to take a data frame that might have many many many rows thousands tens of thousands millions of rows and then pull out the rows from that data frame that match a logical question that we're perhaps interested in right and so You know last week we looked at things like do you care about grammar and if you cared about grammar How often were you to use the Oxford comma? Okay? So this week like I said, we're working with the candy data If you celebrate Easter Perhaps you still have candy laying around your house that you're trying to hide from your children Perhaps you're like me and you're just trying to eat their candy. I'm not ashamed of that so we're going to Load the tidy verse package and we're going to get the candy data in and what we're going to work on today is Seeing more features of the filter command as well as using a series of commands called group by and Summarize to get summary data for different subsets of our overall data for so I'm going to go ahead and copy this And then I'm going to switch over to our studio and I will Open up a new script And I'll paste those lines in and so that you all can see my screen easier I'm going to slide this over to the right and I'm going to make my font a bit larger so that everyone can see it and Now I'm going to go ahead and highlight and run those lines and Down here on the console if I type candy data I can see that my candy data is a data frame with I guess 85 85 different candies and 13 or 12 different variables that they collected on each of those types of candy and so you can see the various factors that they looked at And then some of the types of candies that they collected data on and so you'll recall from last week We did things like candy data So say we want the data on the hundred grand bar Which I don't know anybody that really likes the hundred grand bar, but we could say competitor name equals equals 100 grand So we run that we get the row back for the candy from the person with or for the candy the hundred grand, okay Similarly instead of the equal equal we could do exclamation point equal And we talked about how the exclamation point means not so in things that are not a hundred grand And so you see that we no longer have a hundred grand that it starts with three musketeers So that's what we covered last week. We also then said You know, we could compliment this with a count function. So then it's of all the things that aren't a hundred grand Let's count the bars that we have and we see that there are 20 Candy bars left after we move the hundred grand and 64 non candy bars Okay, so that's what we talked about last week And so what I want to do now is show you two other things about the filter function that Will make our data analysis a lot easier So the first is to note that when we look at this data frame the candy data data frame But a lot of our columns have this LGL under them, which is logical. It's a logical variable. It's true or false and the Argument of filter gets evaluated to determine whether or not the value is true or false for each row of the data frame If it's true, we get that date. We get that row into our new data frame. And so if I were to say Do it down here if I were to say candy data and then to pipe that and then do filter Chocolate I will then get a new data frame with all the rows where a candy contained chocolate And so we can see that here in the second column where all the values are true So sometimes this confuses people because they're like I'm just putting in chocolate what? But again, the value of chocolate in that column is trues and falses. I could do chocolate equals equals true And I get the same result but that equals equals true really isn't necessary Okay So again, that's a really useful Feature of our logical variables in our data frames So that's the first thing I wanted to show you about the filter function The second thing is that sometimes we want to filter on a continuous or quantitative variable Like say sugar percent or a price percent or a win percent, right? Perhaps we want to look at low sugar candies or we want to look at expensive candies or we want to look at The really popular candies, right? And so the first thing that we'll do is to look at the more expensive candies So to do this we can do candy data We can pipe that to filter Where we then say price percent Greater than say 0.5 zero So it's a it says price percent, but that's a fraction goes from zero to one not zero to a hundred So we run this We now get the 42 candies that were above their price percent index And so I think what they did is they took the prices of all their candies and they then scaled them between zero and one And so we've filtered to get the more expensive candies Again, we could also get the low sugar candies by doing candy data filter Sugar percent Let's say less than 0.3. Okay. Those in the 0.33 lows in the lower third and There we find that we have 34 candies that have low sugar at least on the scale Okay So maybe we want the expensive low sugar candies. What would we do? Well, you hopefully remember from last week that we talked about the ampersand meaning and In the vertical line meaning or so we can combine these two functions To one filter to say price percent greater than 0.5. Oh and Sugar percent less than 0.33 And so again by combining both of these Logical statements as an and for the for the value of that for that row to be true Both of them have to be true if one of them is false then the result is false And so when we run that we find there's 12 rows in our data frame 12 candies that were more expensive But had lower sugar. Okay, so things like milk dogs, which I'm a big fan of milk dogs Fit that bill, right? They were Expensive but low sugar Well, we could also do the opposite, right? So we could say candy data Filter price percent Greater than 0.50 or sugar percent Less than 0.33 and so this will give us the expensive candies or the low sugar candies And we find that there are 64 candies that were more expensive than average But had less sugar than most of the other candies and that or means that either statement has to be true so within our we could also say greater than or equal to Less than or equal to right so for this example, it won't change the results But we can do greater than less than greater than equal to less than equal to Equals not equals as some of our logical functions to make comparisons Between two types of variable two types of values. Okay This is what I wanted to talk about with filter for some new features and again. I think those are those are pretty powerful so At the first code club that we did one of the questions I posed was if we took chocolate candies Or can't we're chocolate bars more expensive than chocolate bite-sized candies, right? So as a Hershey's bar more expensive than M&Ms or those types of candies And so hopefully you can hear in my posing of that question that there's a filter statement, right? So we want to get all of the candy. That's chocolate I can do candy data And I can filter to get chocolate All right, so I now get all my chocolate candy but within that I Want to group my data by whether it's a candy bar or not and So I can then use the function that we'll learn today called group by bar and Then a new function summarize And I'm going to say mean price Is the mean function of price percent? So again, what we're doing is we're taking our candy data data frame We're filtering to just get the chocolate data And then we're going to group the chocolate data by whether or not it's a candy bar or not And then if it's a candy bar, we're going to get an average price if it's not a candy bar We're going to get the average price So running that We see that the can the chocolate bars are more expensive than the chocolate individual candies So in a future code club, maybe we'll talk about doing the statistical analysis here to know whether or not this difference is statistically significant But maybe to help kind of give us a sense of whether or not this difference is statistically significant We could then also add the standard deviation of the price And that's going to be equal to the sd price percent And then we can get an end so the number of candies in that bin And so now we see that we have the standard deviation and the end And so we have a pretty good representation of the different of candy bars and non candy bars and the standard deviations aren't Identical but they're reasonably close. And so it does seem that chocolate bars are more expensive than bite-sized chocolates You know All right, so that's a way that we can get multiple summary values out of our group's data Well, what if I want a group? By whether or not it's a chocolate whether or not it's chocolate as well as whether or not it's a bar So to do that what we need to do is we need to add to our group by function and remove our filter function And so we can do candy data And we're going to do group by And I'm going to leave the arguments there blank for a moment because we know this next up which is summarize mean price because mean price percent Sd price sd price percent Do n equals on it? Okay for me when I do these summary functions I generally like to give the means in the end or perhaps if it's not normal data like the median iqr and n These are kind of this the summary statistics. I like to see so what are we going to put in the group by so here We'll go ahead and put in bar And I'm also going to do chocolate So we can group by multiple factors if we separate those column names by commas So I generally only really grouped by two things at a time I think it would be pretty rare to do three or more different groupings It kind of gets overwhelming and I've misspelled chocolate. I noticed Need a space so if I run this I find that I have A decent number of candies that are chocolates or not bars or not That there do not appear to be many barred candies that don't contain chocolate That's like an almond joy I don't know but you know what you now have the power to write your own filter function to go back and figure out what is a non chocolate candy bar, okay, so I'll leave that for you all for homework but what we see here is that If it's a bar even if it's a non chocolate bar We only have one example of that like I said, but its price is comparable to what we found for the chocolate bars and the non chocolate Non bars or the bite-sized non chocolate candies like a skittles say tend to be cheaper than anything with chocolate or anything that is a bar. Okay, so that's kind of cool So something I want to point out to you Is that if I were to do candy? data And then grouped by Chocolate or grouped by bar Sometimes it's too helpful That just running those two commands We see an added Bit of text to the output of our data frame and so this is telling us that it's grouping it by bar Now when we go ahead and do our summarize With say mean price That grouped by that grouping goes away I don't know if you noticed it, but when we ran this statement up here that I have on my lines 19 to 21 where we group by both chocolate and bar The output Still has that grouping and so if you group by more than two or more different factors The output of your summary will still have groups Now I've always found that this feature is It's not very helpful. Uh, it's rarely ever been helpful I'm not really sure if it's ever been helpful So what I like to add after doing this Is ungroup and so now if I run ungroup It's unhappy with me I wonder why I've got a typo here. I've got it. This is j. All right, so Once I clean up all my typos and then do ungroup I see that that grouping goes away. Okay. So again, this is a way of showing that we can group by multiple factors We can I mean grouping by one factors is totally legit, but then we can get multiple bits of summary data out There's many different functions that you might find useful for putting in this summarize Function argument lists. So things like mean standard deviation Are n you might also want to put in like the median Iqr for the intercortile range Maybe the min the max things like that it could be any function as long as it only returns one value For a set of data that's been given to it Okay So with that I have a series of questions For you all to engage with Let me bring those up here And so the four questions I'm going to have you work on Is to determine how many of the candies that won more than 75 of their matchups had chocolate in them Do fruity candies have a different average price than those of non fruity candies? And then how do the prices of the more favored candies compared to those that are less fake? And then finally I invite you to come up with your own question to answer with the functions we've discussed today Okay, so with that we're going to go ahead and break and have you work in groups And we'll come back in about 15 20 minutes for you all to share what you've done And to compare our results All right, so I think everybody is back now Would anybody like to share how they did the first problem? Of determining how many many of how many of the candies that won more than 75 of their matchups had chocolate Sure, go ahead and share your screen if you wouldn't mind and we'll see what you did Thank you. This is the easiest one So I went with the share Okay, so What we did first take our data set. So I think you're sharing the web page rather than our studio No, this this is our studio. Yeah, but what I I couldn't do that. Okay. Sorry Here we go. Can you see now? Still coming. So can you see now my are Okay So the first thing that we need to do is filtration In terms of win percent is higher than 75 How many all of them are Track, why didn't I think it I forget this one. I think the up on 32 and 33 you have A partial pipeline started. So I think if you highlight those three lines And then run it it'll be happier Oh, yeah, here we go. There you go Yeah Wonderful You're welcome What another group like to share? Well, how they determine whether or not fruity candies have a different average price than non fruity candies We could go. Okay. Go ahead So we grouped it by whether By fruity and then from there we calculated the mean price So then if we run that It would show like true or false So false would be the non fruity candies and true is the fruity candies So we saw that there was a difference in the mean price of them Great, wonderful. Thank you And then the third question was how do the prices of the more favored candies Compared to those that are less favored Anyone like to show how they went about figuring that out? I can give it a shot Let me open up our studio here so How do the prices of the more favored candies compared to those that are less favored and so Here I think we'll do a double group by so we could do candy data And pipe that to group by And we're going to do more favored so win percent greater than 50 and Oh, it's just a single. Okay, so we'll just do win percent greater than 50 and then pipe that to summarize and then we'll do mean price Because mean price percent Then we see the candies that were more favored were actually more expensive than those that were less favored and Who knows why that is or I guess the other question is like They're they're not giving us a An absolute price. It's more of a relative price index. So the average price is going to be 50 so That doesn't mean like the difference between something at like 25 percent Is a whole lot cheaper than something at 75 percent, right? So it might only be a couple cents different So maybe if we had candy price per pound or something like that That would allow us to see, you know, how meaningful these differences in price are. Anyway Definitely fodder for future important research Anybody Come up with a question to answer the function their own question to answer with the functions we've discussed today to Dig into the data a little bit deeper that they'd like to share We we try to pursue what you what you kind of ended to at the beginning the best candy I guess or the most often one with low sugar I can share my screen to Awesome, I mean kind of So the first thing we were looking at we're trying to see if there is I guess Candies with high wind percent so over 90 and low sugar content, but then If we run this you can see we have like we have a lot. So we try to To filter it more and have it Um to find only the candies with very low. I guess in sugar. So even 2.1 Uh, but then we noticed wait something is wrong. Uh, wait. No, it is great Um Oh, no, I'm sorry. Yeah, this is great. So when we did this with uh candies with the wind percent over 90 You see actually that we don't have any Uh, which maybe makes makes sense because we need a lot of sugar I guess So we tried to be not as greedy. So we had like 80 maybe so maybe we'll have a winner here And then we did and then there's actually one and then we try to find what's which one was that and that was the Yeah, there is miniatures So that was that was interesting So the Reese's miniatures have Very popular, but really low sugar. Yeah Cool Very good Does anyone else have an example I'd like to share? That's fine. Um Great. Well, um Thanks for participating today and hopefully you Felt good with more practice using filter this group by and summarize Those two steps. I find to be really powerful I was just talking to a friend by email Who has all this data? They've got a bunch of technical replicates and biological replicates And in my mind I was thinking well, you'll probably want to group by the technical replicates Or group the technical replicates together perhaps by biological replicates get a mean and then compare across your biological replicates Perhaps with another group by type of thing, right? So it's a it's a very common series of steps for summarizing data and um Like I said, it's it's very powerful and you can get really far with a few Basic functions like mean standard deviation and median Things like that, right? so um any Comments or questions people have before we sign off Well, it seemed like you all as I was popping into different groups. It seemed like you all did a really good job of Grasping the material and then using it With these questions So unfortunately, we're going to have to take a one week hiatus next next thursday I'm going to be teaching for the university of michigan during this time So we'll be back in two weeks with the next code club. So look for something At the beginning of a week and a half from now On the website announcing the next uh code club So thanks again for coming and I know many of you if not all of you have been to multiple code clubs so far So thanks for your support. Uh by all means feel free to shoot me an email if you have any Um ideas or suggestions for things that you'd like to hear about I'm trying to kind of gradually build up pipelines and I'm very happy to do whatever you all are interested in doing so With that, thanks. Have a great week. Thank you so much. Thank you. Hope you're not getting too much candy while we're hanging out at home