 Welcome to the part one of a two-part workshop. You don't necessarily have to come to part two, but if you registered, they will build on each other. My name is John Little, and I am the data science librarian. I work in the Center for Data and Visualization Sciences, which is a center embedded in the Duke University Library. One of the things about the art community is it tends to be very helpful. And normally if we were meeting face to face, I would tell you to introduce yourself to your neighbor because you might wanna ask them for a little clarification, but we can't really do that so well in Zoom. So it's part of this is up to you, but I shared with you all the resources in advance because it makes it easier, I think, to sort of get a big chunk of art presented to you, even if it doesn't all make sense initially, it will eventually come together. You can rewatch the videos. There is a full recording of this workshop at the end of the YouTube playlist, which is in the RFUN site, I'll cover that. So let's get started. I always like to start by reading this land acknowledgement. So if you will just give me your kindhearted attention, I will read this now. I'd like to take a moment to honor the land in Durham, North Carolina. Duke University sits on the ancestral land of the Shikori, the Eno and the Kataba people. This institution of higher learning is built on land stolen from those peoples. These tribes were here before the colonizers arrived. Additionally, this land is born witness to over 400 years of the enslavement, torture and systematic mistreatment of African people and their descendants. Recognizing this history is an honest attempt to break out beyond persistent patterns of colonization and to rewrite the erasure of indigenous and black peoples. There is value in acknowledging the history of our occupied spaces and places. I hope we can glimpse and understanding of these histories by recognizing the origins of collective journeys. Okay, so I thank you for your attention on that. We are not, this is obviously not a class on social justice and we may not talk about any social justice issues here. It's mostly gonna be about the nuts and bolts of using R in a reproducible manner. But I like to read that and acknowledge that. And I simply ask that if you are faced with areas of injustice, maybe something you learn here today will help you rectify those issues that we all face. Okay, another thing to mention is that as I sent you you should have gotten an email from me in advance. This is a flipped workshop. So I send a whole bunch of material in advance that you can review. And if you review it and attempt the questions, it will help you clarify what questions you have. And so I'm gonna reserve some time after I do the initial introduction. For those people who've done that prep work, I'm gonna privilege them and ask them, do you have questions that are specifically on the material? I'll also ask that questions be centered on the material shared. Well, the best way to learn R is to apply what you're learning to your specific research project. Please recognize that the rest of the people in this webinar, they don't actually have your data. They may not have your background on your discipline. And so it doesn't really help the group to focus on a specific issue that's not broadly applicable. However, that's what we have consultations for. And I am more than happy to meet with you and help you customize our learning plan specifically to your work. And I'll also make the recommendation that after you leave today's webinar, one of the best ways to further and reinforce your learning is to take a very simple project, not a complex project, but a simple project where you know the data, you understand the research process and you may have done it a different tool, maybe Tableau or maybe Excel or maybe Python or maybe Stata, take that project and then just try and replicate it in R. And that will help bring into high relief the things that you don't quite yet understand but you're pretty certain can be done. And it'll help you figure out what kind of questions you wanna ask me. Possibly in a consultation. Now, I also like to mention that for the most part, this is what I call an each your own dog food presentation and what I mean by that is I'm gonna try and model using R in a research process using a reproducible research workflow to demonstrate what R can do. So basically everything that I'm presenting here today including the slides has been generated in R. An exception being for example, the workshop is hosted in Zoom and the preparation email got sent out over Outlook and the survey was done in Google Forms but then R can orchestrate the data that Google Forms collects by reading in the Google Forms spreadsheet and then you can manipulate the data that way and the whole thing is a reproducible process where I can simply essentially press play and if I get new data in the slides update themselves relevant to the data that I have at the time. So I'm gonna try and again model what R can do for you. Now, this first slide here is simply telling you where are you all coming from, right? So we're looking at a stacked bar chart on a slide deck that was generated in R and you can just see that we've got folks from public policy, school of medicine, biology, the health system, engineering, computer science and it looks like psychology. I didn't make that abbreviation. I think it's psychology and neurology, neuroscience. And another thing that R does that does really well is you can, if you can generate a graph that has, with one line of code, you can alter that graph. So I can turn each one of these academic statuses into a subgraph by one line of code and GG plot which we'll learn about more tomorrow where that one line of code is called a faceted wrap where I can take this bar chart and create a facet for each one of these academic statuses. So if I go to the next slide, you can see that there. It's not necessarily the best visualization in the world but it's a nice way to get a different view and it demonstrates one of the things that R can do for you. So minimal manipulation of your data and maximal flexibility with respect to the reports that you're generating. This is data collected from the survey that I sent out in advance where I asked what kinds of skill sets are you bringing to this workshop? And this is a pretty typical curve for how people respond to the survey. Just a simple, by the way, it's just a simple what's called a time series graph but GG plot does a great job of generating time series graphs not particularly difficult to do. A few crazy little tricks and tips that you might want to know about. But here is a proportional bar graph where I'm asking what kind of experience are you bringing to this workshop? And it's a convenient survey. It's not means tested, pre-tested, purely well done social science survey but it gives me an idea of what people are bringing. And one of the things you can see here is that, and this is typical too, we have some people who are coding quite a bit weekly and daily and then we have a number of people who are not coding so much. So if we can all be generous to each other we may have to cover things that we already know but hopefully we'll get to things that you are interested in. I also ask questions about version control and command line interfaces and databases because all of those things have a stake in how you do reproducible work. I have workshops on nearly all of these things that you can refer to and I'll tell you how you can get to those. They also all can be, for example databases, they can all be orchestrated in R. So you can use R to connect to other relational databases. You can subset data in the database and then pull back smaller subsets, all that kind of stuff. We won't cover all of that today but I want you to be aware that it's possible. This is the bulk of what we're gonna talk about today, we're gonna talk about importing data, editing scripts and subsetting data. A huge large portion of y'all who answered the survey feel very comfortable with that, but many people do not. So we'll try and cover that well. I'm gonna talk about projects and reproducibility. Those are difficult concepts to convey clearly in the time allotted. So I'm not gonna go into them in great depth but I would appreciate if you feel other useful links, right? This R fun link is just a collection of all kinds of modules to learning about R and then of course our center's homepage. If I click on this one, I'll just give you a quick tour. There's a whole bunch of modules here like if you wanna learn more about mapping or if you wanna learn how to make slides or how to use Git with R. Each almost all of those have video recordings, shareable data slides, that kind of thing. Feel free to scroll through those. You would have already seen or may have already seen the page that supports this today's workshop and tomorrow's workshop. It has some embedded videos. It has a whole bunch of links to additional parts of information about using R. So something I covered, I got covered it too quickly or you're coming back to it two or three months from now and you wanna refresher on joins and merges or using assignments and pipes. You can find these shorter videos. They're almost all under 10 minutes or roughly. There's also a link to a YouTube playlist which has all of those videos plus the full workshops recorded at some earlier time of part one and part two, roughly equivalent to today and tomorrow. And let's see. Okay, a little bit about good ways to get help in R. The R community is actually known for being very helpful. There are lots of good places to get help. People post questions to places called Stack Exchange. I don't know if you've ever heard of Stack Exchange but you can post questions to there. They also have a R-specific kind of Stack Exchange place that if you Google the phrase RStudio Community, you'll get there. Very helpful people there. Also, there is a Slack channel called R for Data Science. I'm not sure what you would, I mean, maybe you could Google that and get to it. And if you wanna find it and you can't find it, just reach out to me and I'll send you the information. But very helpful community, always trying to answer questions but also they've developed a kind of best practice that they promote called a Reprex or a reproducible example. And the whole idea is to post your question in a way that entices an efficient, relatively quick answer. And you do that by limiting your question to the important part, right? So there's a website that you could read and there's even a package that can help you make a Reprex, but the idea is that you're sharing the simplest, smallest data possible. In fact, if you can use a built-in dataset which we're gonna do a lot of today, you don't even have to share your data because it's easier to give an example when everybody's looking at the same data. And then you share the code only on a need to run basis, right? What generates the error or problem that you're seeing? So I understand since we're all in an intro course here, you may not know how to do all of those steps and I'm totally sympathetic to that. I'm very sympathetic to people who are new to R because it's a lot to take in. It feels like it's a big learning curve. And if you're getting overwhelmed at some point, I just would suggest to you, hang in there, it'll get better, it starts to fall into place. And if you're feeling overwhelmed even by this idea of a Reprex, don't let it get you down, just shoot me a note, we'll figure something out. But the idea is I'm not always available. You might be doing your homework at three in the morning and I'm hopefully asleep. But if you can reduce your question to something that's simple to respond to, you're more likely to get a response. Put another way of the people who like to be helpful in R, nobody wants to get a thousand lines of code from somebody that says, there's something wrong with the third subroutine, I'm just not sure what's wrong, right? You need to put a little bit of effort into localizing the problem and clarifying just the things that are not right. We're not gonna talk about that much more but it's something to keep in mind. Now, as a way of introducing some things that I often get a lot of questions about, I'm gonna talk about pipes and assignments, right? We'll put these into practice in a minute but an assignment operator looks like this. It's a less than symbol and a dash and you can generate it on your R keyboard in R studio if you simply type the alt dash key. Now, we'll talk about this more but you can read this sort of mnemonically as gets value from. So in R, you could type a very simple expression like five times five and of course the answer to that would be 25 and then you could assign that answer to an object name. So in this case, my object name is answer and I might read this data sentence as saying answer gets value from five times five. There are actually four other assignment operators in R but by convention for the most part this is the operator that's used to assign the output of an expression into an object name. The other conventions that's used a lot is the equal sign, the single equal sign and that tends to be used more at least in the convention that we're gonna learn about today that should use more when it's assigning something inside of another function, right? So I'm using a mutate function. I'm taking the value of answer which I assigned up here multiplying that by two, so 25 times two is 50 and I'm assigning that to an object name called answer to. So mutate a new variable called answer to which gets value from answer times two, all right? So that's the takeaway right there. This alt dash is just what's called an assignment variable. You're gonna see that more and more today so if it doesn't make sense right now, don't worry. The other thing you're gonna see a lot of is this thing called a pipe, which is a conjunction which essentially allows you to compose a data sentence. Some people will call this whole thing a pipe. I tend to call it a data sentence and you can generate that with a command or a control shift M and it generates this. You can also just type it out by hand, greater than percent. You can mnemonically think of that as saying and that. So answer and then take the square root of answer and notice that different from the previous answer in the previous slide we had already assigned that a value of 25. So if we took the square root of that, this would respond that console would respond with the value five, right? If we were gonna also assign this to an object name the console would not respond. Hopefully you'll see that more but notice here we're assigning this output to an object name so it's just gonna go into the object name. Here we're not assigning the output to anything so the answer is going to appear. All right, you'll see more of that. Couple definitions to help us set the groundwork for what we're gonna talk about. R is a data-first programming language. It has a mature sense of the data lifecycle and it really helps if you're trying to have a reproducible workflow, all right? It's not the only tool that can do that but as a data-first programming language I wanna contrast it with Python. Now largely Python and R sort of do the same thing. There's almost nothing that you can do in Python that you can't do in R or vice versa. The sort of functional differences. Python is more of a general programming language whereas R is more of an analytical data-first programming language. I think R fits a little bit better into the academy but either tool would be fine. Another way to think of it is if you're analyzing numbers, if you're doing some kind of statistical analysis or other kind of analysis and generating reports R is great for that. Python might have an advantage if you're making phone apps or something like that. I don't actually, I've never heard of anybody make a phone app in R. I'm sure you could do it but it's not really the workflow that sort of naturally flows into R. There is however a tool that's really good for making web applications that works well with R. It's called Shiny. We're not gonna talk about that today but you'll at least have heard the word here. So R is the programming language or the language interpreter. RStudio, which is a tool we're gonna use to manipulate R is really what's called an IDE or an integrated development environment. It's just a mask or an application that sits on top of R that makes it easier to program. And then we're gonna learn a particular dialect of R today called the tidyverse. It's one of many ways to use R. You don't have to, many people just prefer to use what's called base R. The tidyverse is sort of the most modern, one of the more modern approaches to using R. And what it really is, is a collection of packages that all work well together, that are all documented basically in a similar fashion so that if you know one part of the tidyverse it's a little bit easier to get up to speed with another package in the tidyverse. Now, you'll notice in the definition here it says it's an opinionated system of packages. So the tidyverse people recognize that they have an opinion and they're saying this is a nice, easy, consistent way to use R. This is our opinion. If you don't like our opinion feel free to use a different approach but in today's approach I'm gonna show you the tidyverse. I think it's easier to learn. I also think that it's a little more modern leans a little bit more into sort of a data science approach than a pure statistical approach. And that makes it useful for generating reports and slides and websites and eBooks and all kinds of stuff. Really nice approach. But if you're very comfortable with base R and just looking for a little refresher you may learn some new things here. What you will not get from me is a statement that says you must use the tidyverse. It's just, that's what I'm gonna teach. But if you're comfortable with our base R and you wanna stick with that that's all good as far as I'm concerned. Now that phrase tidyverse comes about because of a paper. There's a link to that paper right now by a guy named Hadley Wickham who is the chief information science officer at our studio. And he wrote this paper. He's also like, I think formerly stats professor wrote this paper called about tidy data which lays out some very specific concepts about how you would lay out your data for easy iteration and to imply and embed some grammatical meaning to your dataset. It basically boils down to every column as a variable every rows and observation and the values are the intersections thereof and it tends to mean that you have what's called tall data rather than wide data. Now, if anybody here is coming at this from a perspective of pandas, Python, you may have heard of the package called pandas which comes out of this concept called panel data. Either one of them are fine but in a tidyverse concept, we're just gonna lean into the tidy tall long data approach rather than the wide data approach but there are tools to pivot your data either way because one size doesn't fit all and even the tidyverse recognizes that, right? Say tidy data is a nice container for your data most of the time it's very flexible can be very easy to iterate but if you need wide data, then we're gonna make wide data. If you need to push your data into a database or relational database management system then we're gonna allow you to enable you to do that. If you wanna read more about that you can. All right, so I'm just about to get ready to sort of dive in but I just wanna clarify what we're gonna talk about today we're gonna talk a little bit about reproducibility, our studio projects and literate coding and then the guts of it that we're gonna spend the rest of the time today doing is subsetting data talking about those five deployer verbs that are things like mutate and select and mutate select filter group by summarize that kind of stuff. So we'll come back to that. So just one more definition reproducibility it's obtaining computational results using the same input data, computational steps methods codes conditions of analysis. This is an increasingly important approach as we continue on this sort of long journey of a computer revolution that we find we're all using computers all the time. You really want to avoid those situations where you've done a project and then you look back at it from six months ago and you go, I have no idea how I generated all those formulas. That is actually a pretty common reaction particularly to people who are using Excel which was a great innovative tool when it first came out and it's still a very useful tool but Excel was developed to make it really easy to kind of bumble through what you want to do. It was never really designed to make it really clear how you did what you were gonna do. And that becomes problematic in research because at some point someone's gonna ask you essentially show your work and if you're doing your work in a reproducible fashion it's super simple to show your work it just becomes your workflow and then you can just point to it. So in order for us to do reproducible work one of the things we're gonna do I'm gonna introduce to you this idea of RStudio projects in the upper right hand corner of RStudio you'll see in just a minute you can create new projects a discrete folder for every project that you're working on and it just makes it easier to share that one folder can share that folder with somebody else and you don't have to rewrite any code they don't have to rewrite any code in order to run that RStudio project. For those of you who are used to base R this means no longer using setWD which is not a very, it's not a reproducible process at all. It makes things very endemic to your particular file system. So I'm gonna recommend against using setWD and I'm gonna recommend against using RMList equals LS I'm not gonna define those because they're not super important but if you use those if that's part of your workflow let me help you move beyond that. And then another nice thing about RStudio is it integrates well with Git which is used for version control. I have a workshop on that we won't talk about it much today. Charlie writes, John, I am right at the beginning of this is there an online glossary of these terms? Well, I just defined some and they're in the slides and I'm gonna share the slides with you but if you wanna just put in a word that I've mentioned that's not clear I would be more than happy to give you another definition. The other thing I would say is for everything that I'm gonna use I shouldn't say for everything but for many things that I'm gonna introduce today we're just at an introductory phase. I usually have a more in-depth workshop and or video that will cover these concepts more clearly or more holistically. So just let me know how I can help you. Right, so moving beyond this concept of reproducibility and projects. Okay, got you. A way that you create reproducible code is by using this concept called literate coding. So in the Python world you may be familiar with Jupyter notebooks. That's an example of literate coding. In the R world you may have heard of R Markdown or R Notebooks, also literate coding. In reality, both of those notebook systems are multi-lingual. You can write R code in Jupyter. In fact, Jupyter stands for Julia which is a programming language Python and R and you can use R to manipulate R and Python and SQL. So you don't have to use Jupyter for Python or R Markdown for R. But all I'm saying is those are two notebook examples of literate coding. And what happens is why you use these notebooks is it becomes a compendium record of your work where you can integrate your prose or natural language with your actual analysis code. I'll show you an example in just a minute. But what it means is you can more fully explain what you're doing. You can even write executive summaries and generate different kinds of reports. So I hope to make that a little bit clearer. And we do all this because reproducibility of literate coding techniques within our studio projects with version control enables workflows that are reproducible. They're easy to share through Git and it decreases your dependency on using the mouse excessively because that cut and paste process while super handy is not reproducible. It's really hard to document. Well, I cleaned my data in Excel and then I pasted it over to maybe Tableau and then I created a Tableau output and I copied it and I pasted it over into an Adobe product and I cleaned up my chart over in Adobe. Every time you're cutting and pasting you're effectively breaking a reproducible chain. And then when you look back at that you being your biggest reproducible client. In other words, I mean, of course you wanna share your work and you wanna be transparent and clear about it but the person who's most likely to wanna know what the details are are you because once you separate by time you're gonna forget stuff. So we're gonna use what are called R Markdown files to demonstrate a lot of what we just talked about. And let me just do a quick demo right now. And then we will get into the guts of doing more with R but I wanna cover a couple of things here first. I wanna cover projects, ingesting data and R Markdown files. So what I'm gonna do is I'm gonna share my screen and I'm connected to a remote computer. Duke gives everybody at least one free remote windows computer or it could be a Linux computer if you prefer. And so I like to do this demonstration on a computer like that because it's clean and it's not full of all my little preferences. And you may see stumbling blocks that you are also faced with. So you will have already installed R and I'm gonna go click on this R Studio button and you'll see if you haven't seen this before this is the way R Studio works. It presents initially three quadrants. This is sort of the environment panel. This is the files panel. This is just another view of your file system very much similar to in windows. Here's my file finder where I can go from pictures to downloads, whatever. I can navigate all of that here. I can add packages to this R installation by the packages tab. So I think I've already added this but for example, there's a package called skimmer that I can click on and just click install and that will automatically install the skimmer package and I'll get some information over here in what's called the console. That's this part right here, right? Now in the console, it's effectively it's direct access to R which you can think of as just a big calculator, right? Like any computer but so I can type for example five times five and get a response. And the response is and I'll make this font bigger in a minute so you can see but the response says 25. And if I decided to use my assignment variable I could type answer, create an object name, hold down my alt. Let's see why did that not work? There it is alt dash which types out my assignment variable. Any kind of expression I want it to 10 times 33, right? And then over here in the environment area it's gonna tell me what objects I have in my environment and it'll give me a glimpse of the initial value of those objects. So in this case it's telling me the entire value of that object because it's very simple. It's just 333 but I could create something else. Let's call it, I'm gonna call it length dimensions. And I'm gonna make that an array of numbers sequence numbers two through 500 by five. Let's see what happens there. And now I've got this length dimensions variable and you can see that it's telling me it's a numeric it's a numeric vector that has a hundred elements and then it's giving me the list of elements. If I wanted to see the value of either one of those two things in my console I could just type the name of it. So answer equals 330, length dimension and it's tad completion. So I get that. And then I can demonstrate for you one of the nice values of R is that it enables something called vectorized math, right? So I can take answer and I can multiply answer by length dimension and it's going to multiply 330 by two and then 330 by seven and then 330 by 12 and it's gonna work its way through the whole list. And if I wanted to I could assign that to a new variable a new object name. Now typically people will not do all of their work in the console because they want to keep track of their work and refer back to it. So let me set something up here. First off, I'm going to introduce demonstrate the idea of projects to you. I want to start a new project. I'm going to click on new project. I don't need to save this. I'll just click don't save. And by the way it says don't save our data there. For those of you who are a little bit farther beyond I recommend to you, this is not super important that people are just saying this for the first time but I recommend to you that you go under global options and uncheck restored startup, uncheck previous, uncheck restore our data and change the save workspace to never that's what I do all the time because if it's doing all those restorations you're sort of setting yourself up for missing steps that need to be reproducible, right? You want to create a situation where the script from beginning to end always generates your output. So ask me more about that if you're interested but let me create a thing where I, I'm going to create a new project and it gives me some options. I can grab a GitHub repositories but I'm just going to do a new directory or I could turn an existing folder on my file system into an R project. I'm going to do a new project. I'm going to call it hello world. I'm going to give it an underscore there and if you haven't clicked on this browse before a lot of times you have to find a root project directory where you want to put stuff but I've got a little tilde there which is an abbreviation for the root directory but if I click on browse just so you can see that should be in my case the documents directory of my projects. So I'll click on create project and it'll do a little bit of churning. Now, this is my setup project. So I'm going to minimize this for just a minute and do a little bit of background setup. I'm going to go into my documents folder. I have a hello world that I used before and I have some data in there. So I'm going to copy that data and I'm going to put it into my hello world and let's just dig into that so you can see what's there. There's a small little comma separated value file in there that if I click open, you can see the comma separated values. There's the variable titles are the first line and then this comes from a dataset called Star Wars. So under name the first observation is Luke Skywalker and Luke Skywalker is 172 centimeters tall. He weighs 77 kilograms and he has blonde hair. All right, that's my dataset that I just moved into a folder inside of my RStudio project, right? So if I want to open up my RStudio project I can click on this little link right here, hello world, which is an Rproge file and that'll launch me into RStudio. I mean, you can go back forth either way. I think I may have just created a second instance I did. So I'm going to close one of these because I don't need two instances but there's that same data, right? Now, a stumbling block for a lot of people it's super important is how do I ingest the data that I want to work with? So let me just note to you that there's a button up here that says import dataset and if I click on that, I can import SPSS data, SAS data, data, Excel data or there are two different ways to import text data. A CSV file is text data and then dumps me into a wizard, right? So I'm going to, by the way I always choose the second option. There's the base R option. I'm always preferring myself to not use the base R option because it feels a little old school to me. So I'm going to click on that and it throws me into a data wizard. It's going to do its best guess which I need to browse to my data that's small and there it shows me a preview of the data. It tells me the data type of the vectors, right? So I have a character vector and some floating point vectors and another character vector. And I can change these if I wanted to but it has to make sense, right? This double could be integer. It wouldn't make any difference. I'm going to change it back to double and then what happens is all of that code gets written down right here, right? So if for example, I needed to skip 10 lines of the dataset, I could just put in some information about skip and when I hit tab it's going to write the code for me over here and then I can use this code that it wrote for me in my script. So let's just come back to that. Let's just cancel that and come back to that. Let's talk about scripts, right? Our markdown notebooks. Easiest way to get started is to click on this little down arrow and choose our markdown, our notebook. Our notebooks and our markdown are very similar. This is mostly just for development where this is for more of a production orientation like now that I know what analysis I want to do I want to generate slides or I want to generate a dashboard or I want to generate an ebook. You're going to do a lot of that from our markdown but if you're just getting started you're just developing your analysis easy enough place to start is with this thing called an R notebook and that's going to divide my quadrants. Now I have four quadrants and the top quadrant here is my R markdown notebook, right? And to be honest with you I no longer really need my console anymore so I have a habit of just minimizing that. And then... If you start off with a notebook how easy is it to convert that to a markdown? Great question. It's super simple. You can actually click on this down arrow right here and choose a different thing to generate. So I might want to generate a Word file or a PDF or an HTML. That's going to change this line right here. This top section, these first four lines is what's called a YAML header just basic metadata about the document and it also identifies the output type. So if I change, you can see this, if I change this to knit to HTML, I'm going to hit cancel. It rewrote some of that but now it's in our HTML document. If I change it to knit to PDF it's going to rewrite it again. I don't want to follow through with those. So it's just writing code for me. I'm going to get rid of all that and type, oh, shoot. I shouldn't have done that. And now I have to do it from memory. Output, I could have always started again. I'm sure HTML notebook should. I think I have that right. It might have to be in quotes. I'm waiting for this knit button to change on me. It doesn't seem to be. I don't think it needs to be in quotes. So I'm going to take it out. Other things that I would put in here, by the way, is I would put in author John Little and I would put in a date. And you can put in, you know, a very simple date like Fed, today's the 15th or you can even write in little bits of R code right here if you surrounded in back ticks and type an R function like give me the system date. So that's just the header. And then everything else that comes up here when you open up in our notebook is information about how to use in our notebook, okay? So because I've used these so much, I have all of this stuff memorized. My first step, I'm not recommending this to you. My first step is to just delete all that and then start with my notebook. I'm going to put that back and point out that there's more information about how to use our mark down right there. There's information about how to use these things called code chunks. Remember I said literate coding was this issue of where you're interspersing natural language with analysis. So this is my analysis, this is an example of a code chunk and the rest of this stuff is my natural language where I'm writing out not in cryptic little comments preceded by hashtags but I'm writing out my actual words that I want to be expressed here. So this comes up every single time. There's also a help document right here on our mark down quick reference. If I click on that, it'll tell you how to do some of the structural and editor changes that you can make, how you can make something bold like for example, wrapping it in italics. Now at this point you might be thinking, oh my God, this is so 1970s. So let me not freak you out too much. There is this little icon right here for newer versions of our studio. It's been around for about a year now, this version, where if you click on that, you can get a visual editor like most of us are used to. And then I can do the kind of type setting if I guess formatting that most of us are used to. Like I can highlight the word chunk and make it bold and I can highlight the word placing and make it italics. And I can make this a link by clicking on wherever that link button is and put in a link to something like library. Let's see, I have to put in a full link. Https://library.new, all right, click okay. So if you're just starting out, you might wanna start out in the visual editor. I am going to admit to you that I really prefer, having used this so much, I'm not at a point where I prefer the visual editor. I prefer this old style thing. But look at what we just did. We surrounded the word chunk with double asterisks that made it bold. We surrounded the word placing with single asterisks. That'll make it an underline. And we surrounded the word cursor inside with some text that makes it a link in a link to this location. All right, so that's some of my pros. And here's my analysis. This is just straight R code right here with an onboard dataset called cars. I'm gonna make a scatter plot with a plot function. And if I click on this green triangle, it will generate an inline image, some analysis and put that in a report, right? So normally again, how would I do this? I would put a second level header because this is gonna be my first level header and title. So I'm gonna call that hello world. And then I'm gonna say executive summary. This is an example of literate coding. And I might bold that. And I might actually also make that a link to the Wikipedia page on literate coding. Unfortunately, I don't have that URL memorized, so I'm not gonna do that, but can do. So I've got my executive summary. Another thing I would do is I would load my library packages. So then I would create a code chunk where I would go up here to this thing and click the little R so that I have an R code chunk. Everything that you're about to see next is stuff that you could type out by hand or there is, if you read this information, there's a key shortcut for it. Speaking of shortcuts, if you go up here under help, keyboard shortcuts right there, there's tons of them. So I'm gonna load my library package with a function called library. And then the package I wanna load is tidyverse. I can execute that code chunk. And I have, it gives me some information about tidyverse, which is really a mega collection of several packages. And so when I load the tidyverse, it tells me it just loaded these eight packages and what versions. And it also tells me that this is the long-firm way of referring to the filter function in the supplier package, supplier being part of the tidyverse. But it's telling me that there is two functions called filter and that it's preferring to plier filter and masking steps filter. Just telling me stuff that might be nice to know, but I largely ignore it. And if I wanted to, I could use some of these options to turn like show output only or show nothing run code, I could do some of that stuff. So don't need to worry about that in too much detail. So just going on, next thing I'm gonna do is I'm gonna load some data, import some data. So let's go back to that discussion of data importing. Even though I mentioned to you that you can use this data wizard here, I usually do it from down here because it's relative to my RStudio project, right? My RStudio project called Hellworld. So I'm gonna right click on that. File that's in my data folder. If I just go back here, I'm clicking on my data folder, right clicking on the file, and I'm sorry, left clicking on the file. And I'm gonna click import dataset and I'm gonna get some code down here that I can paste into my script. Really, I just did that so I could get this stuff. And then I'm gonna copy that into a buffer. I don't wanna have to type that all out. I don't wanna have to memorize that. I'm just gonna call it small Star Wars data frame. And then I'm gonna put in my assignment variable, gets value from, and then I'm gonna paste that code that I just had the machine generate for me using a read CSV function and a relative path file system, right? So I don't have, for example, no, but I don't have here. I don't have C, colon, backslash, users, backslash, windows, backslash, John, backslash, special project. Like, if I had all of that stuff there, it would work, presuming that that matches my file system. But if I share it with you, it's not gonna work because you don't have the idiosyncrasy, syncretic nature of my file system. So it's really better. That's one of the reasons to use that, our studio project and to use that data wizard is it's writing for you things that will work on other, for other people, assuming they also have our studio installed on their system. Okay, and then I'm gonna add another code chunk. It says in there that I've, where does it say? I think it says it down here. Insert a code chunk with control alt I, is the same as going up here. And I want to have a look at my data set. And then I might actually say, so I might say, I might, you know, choose to put some text in here. This is a subset of star of ggplot2, colon, colon, Star Wars data and have a look, right? And, you know, my, my literate, my executive summary may also include, I might say, characters get larger at, characters tend to have more mass as they correspondingly gain height. That's a strange sentence, but hopefully you understand what I'm saying. So here I'm gonna, well, this would get a little complicated, but I could, if I had it in order, actually maybe I'll just, maybe I'll just put this lower. I won't complicate my executive summary. I'll put in the visualization and I'll put it here because it'll follow in a linear order. So here I'm gonna say small Star Wars. Let's see what we get so far. Let's go up here and click run all. And there's my data frame and I might to generate my visualization, we're not gonna get to all of this today. I'm going to do, show you a quick visualization and we'll talk about visualizations more tomorrow. So let's run it all again. And now I have sort of the initial guts of my report and I'm gonna click knit. It's gonna ask me to give it a file name. So I'm gonna say, I'm gonna call it first report or draft report and click save. And then it will, when I click save, it will put that file right there. And yeah, and you'll notice that this button went back to preview. That's all controlled by that. So now I have my report, which I can, for reproducibility points, I can click restart all, run all chunks. That's gonna clear out this environment variable, clear out all of these outputs and then run everything. And if it runs from beginning to end, it's gonna also generate a derivative report that's identified up here in line five. And then I can share that report with anybody. If I wanna share it as an MS Word file, I can share it as an MS Word file. It would require a little more manipulation and formatting for it to be a pretty MS Word document. But nonetheless, I have a derivative, I have two derivative reports now and I can share these with other people. And then I can go up here and choose which thing I wanna knit at which time. All right, so that process of rendering a report is often referred to as knitting. All right, so that's the sort of setup, the very basic setup of how you use our studio to generate code that goes inside of, at least in my case, since I'm using an arm heart down notebook, the code goes inside of the code chunks. And I can run them out of order if I want to, but it's easy enough to run them all from beginning to end with just run all. Okay, so let's, let's, let me check back with my slides here. We'll get into the real manipulation. We've learned how to import data. We've learned a little bit about our studio projects, literate coding and importing data and manipulating data. But let's learn more about the manipulating data because in reality, for any kind of data project, this was written up some time ago, now four, five, six, seven years ago, someone projected that for any data project, 80% of it, give or take is gonna be just data manipulation. So we all wanna get the output of that statistical model or we wanna see the pretty graph and we will spend time on those things, but you can't do any of that unless you have data that's well-formed and getting your data that is typically messy in the real world into a well-formed shape takes a lot of wrangling and a lot of time. And that's what we're gonna learn next is we're gonna learn these supplier verbs on how to wrangle our data. Okay, so I wanna go back to my slides if I can find them. And I wanna bring you all with me to go back to my slides and that brings us to Deplier. So what I'm gonna do, probably at this point, I wanna open up the floor for some specific questions for people who did advanced work and then I'm going to explain Deplier and the data wrangling that Deplier enables with new data that you haven't seen before, but I'm gonna show you how to download all of the stuff that we're gonna do. But I will give other chances first, but I would like to sort of gently pause and ask for people who did the prep work and it's fine with me if you didn't, but for those that's privileged, those who did, do you have any questions specifically of stuff that you worked through that doesn't make any sense that could help me focus what I'm gonna say now? Could you go a little bit into how to link the files from the GitHub that you posted and actually bringing those in? I was unable to kind of link directly out. I was able to pull up the GitHub, but I ended up having to like copy and paste into a new document that I created in R. So I know there's probably a better way to go about it, but I just couldn't figure it out. I got, yeah, I get it. And this is complicated a little bit because of Git. And yes, let me go over that because we're all gonna do that anyway. Let me find my, where did I put my, there it is. Here's my R screen. And let's do all this together. We're going to download these two things. This and this. All right, so I'm going to open up. So I went to, this is my GitHub repository. I went to this URL, yarfun underscore flipped. And then I'm looking for this green button, which brings up a context menu. Now, for those of you who have Git installed and you remember when I showed our studio projects that you could use version control projects, this is the information you would paste in, but not everybody has Git installed and we're still introductory. So let's just do, even though it's slower, let's do the always works method, which is to click on this download zip button. And that's going to download that into wherever you have set as your download directory. I'm going to do the same thing with the other repository. And these are the exercises that I'd like to cover in the remainder of today. So I'm going to go here and I'm going to click on the screen button and I'm going to click download zip. Now in Windows and Chrome, it's easy enough to click on one of these context menus and you can get this thing that says show in folder. The exact steps of what you do at this point kind of vary. Where does your browser download stuff go there? And then what you're going to find is that you have two, well, you see that I have several copies of this, but you have copies of the two things that we just downloaded and it's a zip compressed folder. Now you can double click and look into that zip compressed folder without uncompressing it, but I want you to expand it because we may save back into it. And so on Windows, the way to do that, as long as you don't have any extra software installed is to right click on it and choose extract all. So I'm going to put this on my desktop. I can find it, desktop. And on Macs, I think on Macs, you can just double click on it, but I want to stress it's really important that you expand these folders. Don't just look into them. You need to expand them. If you don't expand them, you won't be able to write back into them. So extract all. Oh, I put that in a, shoot. I didn't mean to put that there. I'm going to do it one more time, but I'll just move this to my desktop. Desktop. Okay. Now let me minimize everything and I should have these two folders right here on my desktop. And if I look at our fun, remember I said I put all the slides you've now downloaded the slides. You've got all the exercises that are there that you may want to work through. They include answers. Some people like to work with the answers first. Some people like to challenge themselves and try to figure out the answer. Either way is good with me. But here in slides, there are two PDF versions of the slides and there's an HTML version. Oh, well, there was. Oh, right there. Slides part one. It's got the little Chrome HTML extension. But we're going to go into, also let's double check here, exercises. Because what I want to do is I want to open up this project. I want to find this rproge file. Some of you won't have the view that I have. Let's do medium icons. It'll look like that. That's an rproge file. It says intro to r underscore exercises. And if you hover over it, it's an rproject. If you double click on that, it will launch into our studio as a new project. Now, these are stumbling boxes, things that I just covered. So the question, I'm sorry, I don't remember who asked the question, but if I didn't fully answer the steps on how to download the get up stuff, please let me know. Okay, so if there are other questions, let's hear them. But what I would like to propose, unless you unlike or chat in, is that I'm going to talk about Deplier. So you should at this point see on your screen a slide that says Deplier, a grammar of data manipulation. And so Deplier is one of the, there's got to be at least eight packages that get loaded when you load the tidyverse libraries. And it's all about data wrangling. Tomorrow we'll talk about pivoting long and pivoting wider. That's an example of data wrangling. But there are steps before that that you want to know about. The other thing to point out is that with tidyverse packages, if you know a tidyverse package, so Deplier or GG Plot 2, or Per, you know, the list goes on, you could almost always find documentation with this format, Deplier.tidyverse.org. So if I click on that, the nice thing about this documentation is it's online, it's easier to read, even though you can get it on board in your RStudio. And the other nice thing is tidyverse documentation is pretty much consistent across packages. So you can click on this word called references and you'll get a link to all of the functions in that package. So if I want to know more about nest join function, I can read more and there's usually examples at the bottom. Okay, so going back to my slides. Here, what I referred to is the five Deplier verbs, even though in this table, I actually have six verbs. The first three we're going to cover are subsetting a data frame, which is a rectangular grid, looks a little bit like an Excel spreadsheet, just rectangular data, a data frame made up of vectors of information. So we can use the filter command to subset by rows, that makes it smaller, we can use the select command to subset by rows. The select command is subset by columns and we can use the arrange command to order the rows by a variable in a column. I'll show you a picture of that in just a minute. We also have a function called mutate and a function called summarize, and we'll talk about those in a minute. But filter looks a little bit like this, right? So if you were to paste this, if you had loaded tidyverse in your R and you paste that code, it will work. It's going to take the Star Wars data frame and then filter the variable eye color. Notice there's a double equal sign there for equivalency where eye color, which is a character vector, has the value orange. Because you can have different eye colors, right? You can have blue, brown, whatever. We just want to see where the eye color equals orange. That's how you would write that. That code will work. If you've loaded tidyverse, that code will work in your console or in your script right now. And so visually, right? This dark top rows are table headers. And then it's a four row data set and pick a column. Maybe this one is eye color, where eye color equals orange. And maybe that's just these two. Then we're going to subset that data frame. So it's going from a four row data frame to a two row data frame. Similarly, we can use the select variable to subset by columns. And there's a lot of helper functions to enable sophisticated column selection. But just very simply, you can identify a column by its column name or by its position, or even create a, I don't know what word I'm searching for here. Create an array of columns, I guess, is what the word I'm trying to look for. So this is saying select columns two through four. Or you can combine these two, right? I want to collect select columns name through mass. Maybe there's a variable called height that's in between these two. And I want to select column 10 and I want to select column seven. And I also want to collect select columns four through six, right? You can combine all those. Depends on your purposes, of course, but in the visual, we're going from a four column data frame to a two column data frame. And then arrange is a way to sort, right? So you can arrange by eye color. And if eye color is a alphabetic character vector column, like it has blue and brown and yellow and orange, it's going to sort alphabetically in ascending alphabetical order A through Z. But if you use the descend function to surround eye color, it's going to be reverse alphabetical order. And if you have a numeric column, it's the same thing, right? It's sorting by, I don't know what it is, UTF codes or whatever, something in the background that we don't want to worry about too much. But if it's column of mass, which is numbers of measured in kilograms, by default, the person that has a mass of one kilogram is going to be listed first. And then the last person or character in our case whose job of the hut who weighs like 1,500 kilograms, he's going to be listed last. And if I had wrapped mass and descend, then of course it would reverse the order, right? Let's just take a look at those again. Please un-mic or ask a question or throw it in the chat if that's not clear, but let me try and give you a demo. You should now see my RStudio client and I'm going to make some more changes to make this easier to see. But the first thing I'm going to do is I'm going to open up this file called O1A deployer and I'm going to choose answers so that I don't have to type so much. But if you wanted to challenge yourself today or later, start off with the one without answers and see if you can answer these without the prompts. And so I'm going to open, I'm going to click on that, it's going to open it up into my editor. And the other thing I'm going to do, just to make all of this easier to see is I'm going to make some changes that I don't think you should have to do. I'm going to change the font size to 150% and I'm going to make it so that that's the only thing you see, if that's not large enough, let me know, I'll make it larger. But I'm going to start off here by running the first code chunk, which is loading library packages. And you'll notice that at line 23, the first thing that happens is it's installing a package called Gapfinder. I don't usually put install functions in my code chunk because you only have to install something once. So if you have a script, if you know a little bit of more about R and you're just trying to learn a little bit more, if you have a script that has installed packages in there, you can comment that out because you only have to install it once. You have to load it every time, but you only have to install it once. So it's just taking up time in your script if you haven't running constantly. But in this case, since I don't know where y'all are coming from, go ahead and install it. That's why I was in there in line 23. Now that I have that, Gapfinder has a function in there called Gapfinder. So if I wrote that out long form, it would look like this. And that is a data frame of 1,700 rows and six columns. And if I highlight that and hit control enter, I can see it that way. I see the first 1,700 rows. I can scroll through the first 1,000 rows up to the 100th screen of 10 rows. And I can scroll left and right if it was taking up more of my screen there would be little arrows allowing me to go back and forth. But I'm using the glimpse command here just because, oh, notice that it also says right there, 1,700 by six. But the glimpse command is generally a good way to look at particularly large data sets. It kind of turns your data frame on edge, but it tells you some stuff that's easier to see. It gives me the name of my six columns in the first column. And then it tells me their data type. So I have two factor data types which is for categorical data. I have an integer data type for year. I have a double or a floating point, decimal point variable called life expectancy or life x. I have another integer for pop for population. And I have another double for GDP per cap, right? So I'll oftentimes look at that. I'll use glimpse just to get a sense of what my data is. And then you can see there's a little preview of each vector. And that's the thing about a data frame is that the vectors in the data frame all have to have a consistent data type, right? So for example, where this says 1952, I can't have the word John as one of the elements in year. It's just not gonna work. But moving on to those first three verbs that we just talked about, select, filter and arrange. So the goal is let's subset the gap-minder data frame so that it only displays the year and population. So I'm using select and I'm putting in year and pop, right? Notice if I just run line 48 without the pipe control enter, I'm wanting to just create a subset of year and pop. And I can do that that way. And that gives me 1,704 rows by two. Now, that does not permanently change the gap-minder data frame, right? I can type gap-minder here. If I run both of these together, I've got two different data frames, but the difference is the first one, while it's subsetted to only display two columns, if I go back and look at gap-minder, it's still a six-column data frame, right? So if I wanna make this permanent, I've got to assign that to an object name. So I'm gonna call it gap-small, and I'm gonna put in that assignment variable. And then if I wanna look at it, I'm gonna have to call it again, gap-small. So I can run this code chunk. And now, because I have something called gap-small, I can refer to this again and again and again. I still have not changed the original data that came to me. All right, so that's select, subsetting down by columns. Similarly, you can select, you can subset by rows. So in this case, I'm saying where year equals 1952. And you'll notice as you scroll through this data frame, that 1952 occurs quite often because this is population. For countries taking every, I think every five years, right? So actually I'm gonna select to where year, note, double equals for equivalency, where year equals 1987. And that is going to turn that 1700 data frame down to 142 rows. So I've subset by rows. I still have all six columns. And of course, I could combine my two functions that I've learned so far to make a more complex data sentence, select country through year plus pop. So this isn't doing much, but, right? I now have a smaller data frame, but that's selecting by year and by subsetting by column and by row. And then a lot of times you want to arrange and we talked about that. You can arrange a population by default. It will sort alpha in ascending order, right? So if I run this data frame, here I've got the lowest population country, which is a country in Africa and that existed at least in 1952. And again, I've only got a thousand rows here, but it goes up through Cameroon, which has a population of about 9 million, 9.2 million. And then it's nice to know that you can subsort, right? So if I wanted to sort in descending order by continent and then where there are matches, subset that in reverse numerical order, I could do that that way. And I would get something like this, where Oceana is last alphabetically, the last continent name in alphabetical order, right? So then I'm subsetting by year, 2007 comes first, then 2002. But what if I wanted New Zealand to come first, right? I could just further subset this or further subarrange this to a range by country. And oops, sorry. I would have to also be in reverse alphabetical order. And now New Zealand comes. So anytime there's a match, you go to the next argument. All right. So next up we have mutate and summarize. Well, let's do mutate first. First, let's take a quick look at the visual aid in hopes that that will make things a little clearer. So mutate, in the case of mutate, it's often described as you're generating a new variable, but you could also alter an existing variable with mutate. So if I take Star Wars and then mutate, I'm gonna create a new variable called big mass where big mass gets value from the existing variable called mass, this is my expression, mass times 100, right? So I'm taking, if this was a two variable data frame and I run this, I now have a three variable data frame where I have name, mass and big mass, right? And then you can put all kinds of expressions in there, right? In this case, a mathematical expression that takes mass divided by height, well, it's height divided by 100, which is the divisor to mass, I'm sorry, but first it's put to the second power. But you may also mutate things where you're mutating character strings. And in that case, I'll just note that because R is a data-first programming language, it's sort of naturally easier to manipulate numbers, more so than it is to manipulate text, but manipulating text is totally possible. It generally means that you're manipulating text with what are called regular expressions, which we're not gonna talk about today, but that's what's happening here. I'm using a function that is part of the stringer library, which is our tinyverse that enables regular expressions, but in this case, it's doing something very simple. It's just making the variable hair color all uppercase. So if blonde was listed in lower case, in this case, string to uppers gonna make all those letters uppercase, and then I'm using a string under source C for concatenate or combine, and I'm making a nickname. So if hair color was blonde, this person would now have a nickname variable, big blonde, right? That's all that's saying. Now, that can be a little bit confusing. So if it doesn't make any sense to you, just ignore it. Let's have a look at going back here, hopefully, to the RStudio. Here we're using mutate. We're gonna take the life expectancy column, and we're gonna multiply it by two, and we have a round there, which is gonna round off the answer. Let's go ahead and do two of these just so we can see the difference, mutate life double equals life LIFE expectancy times two. So this one, one of them is rounded, one of them is not, and there's the difference right there. So that's rounded to two, and this is the full number. Of course, I could give that round a, I could give it an argument that would allow me to change the length of the number of the value there. I could also, if I wanted to combine that, of course, with select so that I can do country. So that's what mutate is doing. If for some strange reason, I wanted to overwrite the value of life expectancy, I could just write in another mutate function and take life expectancy, take the original variable and say that it equals life expectancy, oops. Life expectancy divided by five, and then I'm just gonna overwrite the original in a case like that, because I'm using this assignment variable here. So I really should have read that as gets value from, gets value from, right? So that's what mutate is doing. And that brings us to count, group by and summarize. Which I'll try and introduce to you next, unless there are questions, back to my slides. All right, so I like to introduce, let's go back a couple of slides here. So summarize and group by are really powerful. And it's basically used for getting column totals and column subtotals. But the syntax is a little funky. So I like to start by introducing count, which is really a specialized function of summarize, right? So I think it will help demonstrate what's happening here. But if I had a variable called gender and it might have values like masculine and feminine, and I wanted to see how many rows in my data frame had the value masculine and how many had the value feminine, I could write that expression right there, that data sentence, Star Wars and then count gender. And it's gonna give me back a table of that count. So let's take a quick look at that. Oh, I got the distinct there. We're gonna, right, where's count? Counts up here. So using this example, if I look at GapMinder, I have a variable called country. And you can see this is such a big data frame with repeating data values. It's really kind of hard to tell what other countries are listed other than Afghanistan, right? Well, I tend to just use count, but there is a more specialized function called distinct. But if I type count and run this code chunk, it's gonna give me, and in this case, it's a very clean data set. It's gonna give me information that says there are 12 rows of Afghanistan, 12 rows of Albania, 12 rows of whatever, and it reduced that data frame to 142 rows. Just to bring in Star Wars, that example that we were looking at a minute ago, count gender, I run both of those at the same time. In the case of the Star Wars one, I've got 17 characters that are feminine, 66 that are masculine and I've got four NAs. By the way, in terms of dealing with NAs, you can do stuff like this, drop, can a, gender, and then, and then I'm limiting my count that way. I can also, I could of course arrange this if I wanted to arrange where in descending order by N, right? So that's count. That's an example of column subtotals, right? The column is gender. The values in gender are either masculine or feminine. And if I want a subtotal of how many masculine are we feminine, I just do that, right? So, back to the slides, summarize, oops. Summarize is really the function that count is built off of and it's usually worked in combination with group by. But, bear with me here for a minute. Notice that I'm using spelling summarize with an S. That's because R was developed in New Zealand and they, I guess, predominantly use the British version of the British spelling, British English, spelling for British English. But they support both British English spelling and American English spelling. And so you can use either. And if you're lazy like me, when you're using the type of head buffer, summarize with an S comes before summarize with a Z. So I tend to use what comes first, but it's handy if you're, you know, for example, if you're used to spelling color with a U, you can spell color with a U when you're using the function that has the color argument. Okay, so in this case, I'm taking the Star Wars dataset and I'm getting some column totals on things like the minimum height and the maximum height and how many distinct entries of height are there, things of that nature. Let's take a look at that and then we'll introduce group by. So here is, there's my American English spelling. I take Gapmider and I want to sum all of the values in population. Of course, this doesn't make any sense because this is population for 1952 and population for 1957 and population for 1962 and accumulating these doesn't really make much mathematical sense, but it just demonstrates how you can sum a column, right? I can summarize pop by using the summarize function, creating a new variable name, sum of pop, gets value from and then using an R function called sum to sum pop, right? That's this right here. So if I do that, I get this really big number, which by the way is a little hard to read, but it's a double floating point data type so you can do math on it. If you really needed to read it, there are different things you could do. For example, use that function to make it easier to read, but now it's a character data type and you can no longer do math on it or you'd have to convert it to integers or numeric in order to do math. But notice another difference here is that in this case, I'm getting a column total, but what if I don't want totals on the whole total? What if I just want totals by each year, right? So there I'm gonna use, that's where the group by comes in. That's what count effectively does for me is it makes it so that I don't have to write the group by function, but of course it's a specialized and therefore limited function. So let's take a look at this. I've got gapminder. I want a group by year. So this is my subgrouping and then I got a subtotal for every one of these years by the variable population. That's what's happening there. Group by year, summarize total pop, which is the sum of pop, right? And then I use that special function to make it pretty. So it was easier to read, but I end up with a 12 row data frame where the last there was data for was 2007 and we have a number of there's, looks like there was 6.25 billion people in the world according to that data frame as of 2007. Right, so that is the bulk of the deployer verbs. Not sure what I've forgotten to tell you and I'm not sure what you don't understand, but you've been very generous and quiet. So I want to open the floor to make sure you have a chance to clarify things. Don't make sense. And while you're thinking of your question, I want to tell you two things. One is I have a survey of evaluation survey that you might want to fill out and I would be happy for your response. You'll get an email that has this link in it, but I'm going to put it in here. The other thing is I'll tell you what's going to happen tomorrow is we're going to talk about how to make visualizations, how to pivot data wider and longer, how to do what are called left joins, which is to join two different data frames by some common value. And of course you can join data frames that don't have common values. That's a different thing. And we'll talk a little bit about how you manipulate simple linear regression model in R. So I want to wait for questions and so I'm going to pause for a minute, and then I'll have some other comments. Does the order in which your statements within a piped group of code matter? So for instance, like in SQL, you have to have your select and then from and then group by. But I noticed in the example here, you specified the data, a gapminder, then you grouped and then you did your aggregation. If you have blocked lines 138, 139, does it make any difference? Group by has to come before summarize, but that might be the only situation where order matters with respect to the other verbs that we talked about. So now that I've done summarize, I could still mutate, right? I could create big pop equals total pop times two. Well, that didn't work. Oh, that's because of them. So I did that fancy thing. Oops. But it does matter with summarize because in order to get subtotals, you have to tell summarize what is the object that you want to subtotal, right? Does that help? Yes, thank you. Okay. Yes. I have a question about that pipe function that percent a larger than percent that one. I kind of confused. Sometimes it seems shows some parallel relationship between the two lines. Sometimes it seems like a level by level thing. So I kind of confused about that. So, okay, let me just ask you, are you saying like, for example, you can have two pipes on one line like that? Yeah, I think so. Because I'm still not very clear about that pipe function. Okay. Good. Thanks for the question. So, oh, and I see that some questions came in on chat. So I will get to those. So the pipe function, the reason to go to new lines really is just for ease of reading, right? I mean, a general suggestion for when you write code is to try not to get longer than about 60 characters wide. That's a real loose rule of thumb. And if you violate that rule, you should not feel bad. But in my experience, code is generally easier to read if you can break it up. And now the pipe itself helps with that, but all the pipe is as a conjunction that again, if you think in your head, it says, and then. So let's read this whole thing as if it were a data sentence. It says gap minder and then group by and then summarize and then mutate. So you could do it all on one line and you would get exactly the same response. And the only difference would be, it would be a little bit harder to read. But those two statements, those two data sentences, one from line 137 and the one that begins 139 through 142 are gonna generate exactly the same output. And so there's nothing technically wrong about this. And the pipe does not really privilege anything. It really only serves as a way, as a conjunction, as a way to manage the flow of your data sentence. I hope that, I hope that answers your question. Thank you. And then whenever you want to, because it seems like in this notebook, you can deal with more than just one dataset. But whenever you want to deal with one and you always start with that dataset name, right? Yeah. The general convention is to always start with the data frame. But yeah, absolutely. You can use other datasets. You can load multiple datasets. We'll deal with some of that tomorrow, but for example, here we've got three data, two different datasets, three different data sentences in one code chunk. Okay, that's helpful. Thank you. Okay. Okay, I see Morad asked a question on chat that says, is it possible to make a dataset available for everyone by directly calling it an R or would people need to download data first? So the answer to that is if the data are in a package that everybody can access, then everybody can access it, yes. But if it's data that's unique to you, they would need to call that. So for example, I think if I just type data, I'll get a list of all of these onboard datasets that exist in R when you are either exist in R or exist within packages that you have loaded, right? That's what just happened when I typed data as it threw me into this new tab. So all of those datasets, if the people have these packages loaded, right? Data sets they may already have. And we all loaded, we loaded tidyverse, which included Deplier. So everybody's got band instruments and band membership available to them. I'm sorry, band instruments and band instruments too. And GSS cat. But if they don't have that, let me just at least show you that there's a, yeah, you would either have to send them the data, but if you have the data, for example, on something like GitHub, a lot of times you can read the data from GitHub. So let's look in this GitHub repository and read in this dataset right here. Let's take a look at it. You can see that it's a data frame of some, maybe four, five variables. It looks like it has something to skip. Line one needs to be skipped. But let me, I'm going to click on raw to get a URL. And I'm going to copy that URL. So if your data are, essentially, if your data are available on the internet, then you wouldn't need to send them the data because I could do something like this. Food, data frame, gets value from read underscore CSV, and then put in that. And remember I said I needed to skip one line. So I'm going to put in the argument skip. And there, if you could put that same thing into your data frame, and you should now have that data frame. And so you could share it that way. It could be mounted at least on GitHub. I don't know if Dropbox and Box work that way or not. They might, I've not tried that. I hope that answers that question. Let me know if you have a follow-up. And I see that, please forgive me, Timmy Taiyo asked the question. Assuming you wanted to replace the NA instead of dropping it, what's the code replacement up? So yeah, you can do it. There's a replace NA function. So Star Wars. Let's just, let's subset it down so it's easy to look at gender. So now we've got that. And let's do this. First we'll say filter where it is.NA just to see what we've got. That didn't work, oh wait, is.NA gender. So yeah, we've got a couple NA's in there and maybe we want to replace them. We want to replace them with mutate, let's just double check, NA, oh, I was thinking there was an NA replaced, but I don't think there is an NA replaced. I think there's an NA if, which does something different, right? NA if allows you to, for example, if you had, although you could do it this way, let me at least answer your, excuse me, at least answer your question. I can say mutate. I'm going to call it new gender. It gets value from if else gender, well actually I need to do a true false there and the easiest way to do a true false is to say is.NA. So if that's true, then we're going to call it 999. Well, we'll make it numeric. And if it's false, we're just going to call it gender. Oh, actually, sorry, since this is character data frame, this is going to have to be characters on the rapid. So let's run that, you can see how that worked. Let's then, let's take, let's comment out this filter statement and then type count new gender. And we should see that we altered that to 999 and it has a value and there are four of them. Remember, work through those examples. It's a great way to learn, but I am more than happy to meet. A library function again. You mentioned that for every package, you only need to install the package once for your own studio. But it's like for the notebook, that for each project, you have to use that library of that function for once, for one project. Not for the project, for the script, for the notebook, right? So every script, the way I do it is I just usually have a load packages statement at the top of each one of my scripts. So let's take a look at this O2 Avis. If I open that up, you'll notice I've got that statement right there. And this EDA, I've got three libraries loaded right there. Does that make sense? Yeah. And so the way you prefer to use notebook rather than use the script install. So the value of the notebooks to me is the ability to do the literate coding where you can be much more expressive using natural language to explain what's happening in the analysis inside the code chunks. And that's also helpful because then I can, for example, I can generate slides like I opened the session with all of those visualizations or manipulated from code chunks. But I could use words to explain the visualizations and I could keep all of that together. And that way, if the data changes, I don't have to rewrite the expression. I don't have to paste it into Microsoft Word or to PDF. But I'm not gonna tell you not to use a script. If you prefer to use a script, you could just do it this way. You could library, I spell it right, tidyverse, and then say Star Wars. And then say Star Wars. I'm just gonna make a very simple visualization here. GG plot two, AES, height, mass, G point. Now I should be able to save that and call it example.r, save. And it should show up right there. And if I run all of that, uh-oh, something didn't, why did that happen? Could not find function ggplot. Oh, that doesn't make any sense to me. It literally makes no sense to me. Ggplot. Oh, that's because I, sorry. That makes sense. It's just called ggplot. I'll put my comment there. I should be able to run that whole thing and get a visualization over here, right? So it's kind of a matter of personal preference. I would tell you that having to comment your code out like this and put messages to myself and others, that that tends, that whole process in that it's not really natural language and you have to use comments in front of things, tends to promote being cryptic and it's not a great way to document your code. But it's not wrong, right? It's from what I'm trying to teach, I'm trying to teach a reproducible method. And I would tell you that the notebook is a better way to do that. Yeah, and it seems like the script only worked for r, right? But if you use the notebook, you can also insert other chunks like from like Stata or other, right? I don't know about Stata, but you can definitely have other kinds of chunks, right? There's an R chunk. If I had a Python compiler loaded, I could also include a Python chunk. And if I was grabbing data, maybe I would do that before these two manipulations I'm trying to get my cursor to behave. I could also insert an SQL chunk and do a, you know, pull data from a remote database. So yeah, the notebooks are definitely handy in many ways. Okay, thank you. Yeah, sure thing. Okay, I don't know. I'm interested in formatting data. Normalizing data before you get to use it. Yeah. There's something about that. When do you, well, how do you determine if you have to normalize your data and what are the dangers of doing that? Oh, well, it's a good question. Do you have an example of what kind of data you think you need to normalize? Okay, I was going to do a regression on like five variables. And one is reading 700. The highest is 700. And the others are like 0.1 and 1 and 2. Yeah. Okay, so I'll be honest. That is more of a statistical question for which I don't have a good answer. I know that you could, for example, scale the big number to one so that everything falls between the values of zero and one. So for example, let's take a look at this economics data frame. You'll see that there's some really big numbers in pop is way out of the norm of PSABRT. And unemployment is much bigger than similar to PCE but much bigger than unemployment. And I'm going to show you this economics long. And this is where someone has pivoted the data. But the point that I want to show is that they also scaled the data so that all of these numbers now are between zero and one. And at least from a visualization standpoint, that makes it much easier to visualize ggplot, ggplot, AES date value of one geom point, AES color equals variable. Let's see if I did that right. Yeah. So now you can see that the scale here on the left is between zero and one. And if that wasn't normalized, it would be much harder to see common trends because some of the numbers are so much larger, oops. So let's just take a look at the difference here. Here's the normalized plot and here's the non-normalized plot. And the nuance of the data for many of these variables just gets lost because the pop value is so much bigger. On the other hand, when you look at a chart like this, and now I'm only talking, I'm not talking about the statistical implications, but just from the visualization implications, you have altered the scale. So it's much harder to look at this visualization and have any clue that the pop variable is considerably higher number than all of the other variables. So it kind of depends on what you're trying to to, what you're trying to display. And there are, at least from a visualization point of view, there are ways to tell the visual story without normalizing the data, right? So for example, I could say facet, wrap, variable, let's just double check how this shows up. Oh, hold on, that function's not right. Facet, wrap. If I do that, now the scales are not right, but let me just double check some documentation so I write down the right thing. Here we go, scales, scales, scales equals three. So now I've not normalized the data to itself and the scales are all different, but the trends can be appreciated alongside of its compatriot data, but still have a sense of the actual unit value of the data for each individual chart. So there are tons of trade-offs visually. There may be trade-offs statistically that I'm not able to express to you because I don't know the answer, but some of the graduate students that I talked about earlier in the program are graduate students who work in the lab and man the chat or staff the chat, sorry. They might have a better answer for the statistical implications of normalizing the data.