 Resume recording. I found it. It's alt P abrupt transition, but I think that's an important message to share Okay, so what I'm going to do first is I'm going to talk a little bit about who we are and how we got here and what we can do with our and I often like to say that This is an eat your own dog food kind of presentation and what I mean by that is I've used are to do everything that you're going to see today the slides the visualizations the The coding Putting it up on github all of that has been orchestrated through our it's often been said that our Can be a universal or ultimate interface to kind of workflows that are reproducible and I hope to I hope to model those for you And so you can see some of the things that can happen with our The one exception is of course you have filled out many of you have filled out introductory survey forms and The survey forms are part of Google and the registration system is a third-party company called spring share, but otherwise I'm taking in the data and using are to manipulate it in the case of the Google forms You can even use are to pull the data directly from Google So here's what I wanted to show you on this slide. We are this kind of this kind of makeup that I typically see I think the slides might be a day or two out of date but Generally, right? We generally see engineering bio stats that kind of stuff in the top group and usually we are very much a grad student heavy group, although we My center is available to anybody on campus and we often see folks from all walks of life on campus and we're happy to have you Here's another visualization The point that I want to make with this visualization actually wouldn't necessarily recommend it I don't think it's a great visualization But it's the exact same data using really just one or two different commands or functions out of GG plot And it presents in a different way and so it's just breaking it down by graduate student by staff by faculty And you can see much the same kind of information I did a time series just pulling in marking off how many people filled out the survey and at what point This is not in real time time series I can't do that, but this is I think up to date to about 10 a.m. This morning really interesting comment that I'm trying to understand myself is that For some reason this introductory group tends to be a really high Responder in terms of a response rate, but in my part, too, I generally don't get nearly as much of a response rate So I find that really interesting But I admit I don't have the time to to design a full-on You know sociological study of what might be happening there But this is this is what we're gonna try and we're gonna try and cover some of this stuff today I always like to ask this question like where are you and how do you fit into this group? It helps me understand what I can kind of move a little bit faster with or a little slower with not surprisingly, I would say more than half of you close to two-thirds are Coding less than once per year. So you're definitely in the right place But there are people who are coding daily It's possible that the people on the on either end of this bar that folks who are doing it daily and the folks who have done it Never will feel either frustrated by the slippage or frustrated by it too quick of a pace That's okay what I like to remind people is that I am available for consultations It's great if you can watch the videos or you can stay through this session But if if you're feeling really behind and lost, I'm sympathetic to that I don't mean for that to happen and I will happily meet with you Later we'll talk some strategies on how to do that. And if this is really too slow for you Again, some of the most useful ways to really leverage your our knowledge is to is to work on your very specific project But we won't have time to work on your individual specific project today We'll do that in a consultation that your convenience. I'll send some information out about that I generally like to ask people how much they've used model of some kind of statistical modeling It's a it's a similar kind of graph, but there's a really large group is using models several times a year So R is a natural language for that kind of thing The two languages that we support most frequently in my center are in Python and quite honestly you could use either one For this kind of thing although What I would say distinguishes are R is generally referred to as a data-first programming language meaning it's all about the analysis Whereas Python is Effectively more of a general programming language. So if you're kind of really into coding you're wanting to make apps What that kind of thing? Maybe Python is better, but they're both turning compliant applications. They both do a whole bunch There's only one thing I'm aware of that Python does that are doesn't which has something to do with Controlling microcontrollers, which honestly, I don't even really know what that means. I know it's something. I'm never going to do I much more into the data The nice other nice distinguishing feature about R. I think is that it has a very modern and rich and complete sense of the full data life cycle Which means you can do so much of your work in R and the more work you can do in R The more likely you can generate your work as reproducible outputs Which is a goal for many. I often ask about version control because version control is Version control which would be I would define as using git and github Although there are other aspects is a foundational importance to reproducibility It's fine that you haven't used these so much I do have a workshop on version control and happy to point you to that Often asked ask about shell or command line interfaces similar stuff not a lot of exercise not a lot of expertise in this but an important comment to make is that in a lot of ways Python and R as programming languages are effectively command line interfaces in other words, they're not point-and-click so much and one of the values of that as The world evolves into a more sophisticated understanding of computational thinking and computational workflow is That point-and-click interfaces are often not very reproducible Whereas command line interfaces can be strung together in a linear way and so those those two things kind of go hand-in-hand version control and Shell and then I'll usually ask about databases not a lot of people using databases But I like to make the comment especially for those of you who had You're using databases regularly that are has a very nice package called db plier and db plier enables you to interface db plier and a couple other packages enable you to interface directly querying databases directly and One of the nice things about db plier we're going to talk about tidy verse and deep in d plier Is that once you learn my d plier verbs? You can actually use those same verbs to query the databases and you don't actually have to learn SQL which May be a good thing I think it probably is if you don't want to be a database engineer, but you just want to deal with the data coming out of a database Works great if you have a database a database administrator Managing the database and you can just deal with the data Okay, so this is really the main reason why I asked the survey I like to get an idea of where people are in their our journey Surprising to me this time, but it's great is that about at least half the folks said yeah, I'm good with importing data. So that's that's really great It's a little more mixed on the editing scripts Huge part of today's discussion will be about subsetting data and it seems that that's going to hit the right note We're going to talk about projects and reproducibility and importing data that in that they all kind of lend themselves and editing scripts They all kind of lend themselves to best practices. What can you do to make your scripts? reproducible and easily Used not only by others, but the quite honestly the target of reproducibility is actually yourself Because within six months, you're going to forget all the things you wrote down and it's just helpful to have these good practices Okay, I Like to ask about data management workflow Reproducibility is one aspect of that, but I also like to ask because our will help you with your data management But also because in my center, we have two data management experts who will help you Data management is becoming more and more important in the academy a lot of times We can help you with the data management tools. So you're getting You can you're more likely to get the grants written in a way the grant applications written in a way that promotes data management And we can also help you at the end of that project process with ingesting your data outputs into the proper repositories Including the repositories that the general repositories that the university manages One this next slide is really just another one of those check out how cool R is this is the exact same Code with one switch just using a ggplot theme and it's just nice that I didn't have to think about all the colors I just used one other theme. So I'm just trying to show off Some of the things R can do Now ours are our studio tidy verse They're all related R is the programming language Our studio is an application that sits on top of R that makes it easier to program in on R in R And the tidy verse is a collection of packages that work well together That kind of brings R a statistical language into more of a data science data first context One of the things that the R studio tidy verse community is well known for is being helpful There is a you can Google this anytime you want There's a you can Google the phrase R studio community You'll find a online board full of people who know a lot more about R than I Who will help you when you're like, I just don't know how to iterate through my data frame Or I don't know how to do this or I don't know how to do that But to bring that a little more locally last semester I started a Microsoft team called our fun because our fun is my sort of branded series of workshops on R and I put it into the university's slack like tool. It's called Microsoft Teams Many of you will know what Microsoft Teams is you can get to Microsoft Teams Through Outlook And if you're in teams you can click on this little teams button If you're in Outlook on the web, you can click on that little teams button Then you can click the join and create button and then there is the code You can take a picture of this if you like Or you should be able to click on that link And join that way As far as I know, this is the way I tried to set it up This is specific only to Duke you have to have a net ID in order to get into this team I As again, I started this last semester. It got a little bit of traffic My goal is to recreate a local helpful community Where people can ask Questions and answer questions ask and get answered questions or maybe you know the answer everybody knows something About R. I don't mean to be the university's only expert on this indeed. I am not But it is my job to be helpful. So this is my approach at trying to Um Cultivate that community. So feel free to join our the microsoft teams our fun teams site And post your questions there Because of course some of you might be working at midnight and I assure you I am asleep Or ideally asleep by midnight. So i'm not going to answer your questions at midnight, but you might not want to wait Feel free to uh take a picture of this. I think I've sent you a lot of these links already But you can schedule me for consultations All of my colleagues are available for consultation. So you can email us and ask data I've got drew on the line drew as a gis specialist helps with mapping Drew's colleague our mutual colleague mark thomas is a mapping expert our colleague uh eric monson specializes in visualizations and there was a lot about python and uh tableau and no some matlab I mentioned our two um specialist in data management that's sophia and gen and then we also have two econ grad students who are in computational economics Uh, who happened to know a great deal about uh statistical models much much more than me um the econ students only Interface with people through a chat interface all the rest of us will make appointments But they only get paid for x number of hours a week. So you can go to our site and you can contact you can connect with any of us um I encourage you to try and connect initially through the email Uh, but the chat service is available on time all the time The reason why I say the email address is I mean you're welcome to send me email directly I'm happy to get it. Uh, but you don't know if I'm on vacation or whatnot. Whereas They ask uh data email will be more of a triage to get it started make sure your question goes to the right person So the resources for today, we're really if we get this well, we will get so somewhere We're going to focus on the exercises After I get to the point of asking your questions But these are all useful lengths This is the code base the rfund flipped code base is the code base for today's workshop That you may have already looked at The exercises will be supplementary to that The rfund site is just a branded site of just the r workshops and then you can find out everything else about our services at this link One more comment about and I'm almost done. I'm almost done with the introduction. Um, one more comment about Uh Getting your questions answered The r community is known for being helpful and as as such they have sort of introduced I don't know as I would call it a standard or even a convention just yet But it seems to be gaining traction this concept of reprex or reproducible examples in code And what I want to promote to you is this is the most efficient way to get help on your question You can go to that link and learn more about reprex. There's even a package that will help you turn your question into a reprex question But in short Let me put it to you this way Nobody wants to receive 500 lines of code and have someone say somewhere in there. I have a problem Right or even, you know in the third sub routine, I'm not sure why it's not working. So the concept of reprex is Uh reduce your question to its smallest most reproducible element and then include only The code that needs to run it and the smallest bit of data that will reproduce the problem I will caution you and I'm sure you all know this that if you're posting questions to internet sites And you have personally identifiable information in your in your data sample, of course you want to scrub that out But the idea is to is to reduce this all to a very small easily reproducible problem so that you can get a solution Um, I guarantee you that you get much quicker answers on the internet when you do this It's a great technique I am also sympathetic that if you are brand new to r you may not really have the tools To create that reprex. So I'm happy to get Your any kinds of questions and I will probably ask you to do things like make it reproducible But, you know, feel free to say hey, I just I'm just brand new here I don't know what I'm doing and I'm a little bit lost and I'm I'm very sympathetic to being new to um programming languages feeling a little bit lost so Help me help you and we'll figure out an answer to your question All right, this is the the end of my slide deck just so you know the slide deck is actually available I'm going to go back to slides three slides if you downloaded this code base the slides are in there They're buried in there, but they're in there um, and This is just one more example of what r can do Uh, this is pulling your survey responses. So there's a left side and right side to this survey The left side is the pre survey responses and the right side is false because there have not yet been any post survey responses So I just I just basically doubled up your answers and put them both on the same side But what you can see across each row is the difference aspects that I'm going to cover today So my questions to you before the survey is how comfortable do you feel with the about these different aspects? So in terms of importing data The majority of you feel agree that you feel comfortable with that And I'm hoping that in the post survey that what I will see so I can judge my effectiveness Is that there will be less brown in this survey and more blue and then I'll be moving all of you over to that blue side So that's just a little plea to say I hope you did fill out the pre survey and I hope you'll fill out the post survey I was delighted that I got a 50 response rate. Thank you so much for all that you have gone through Just to be here today. I know it it feels like a lot um That said let me let me Comment on what we're going to go doing going forward. I'd like to open up for your questions But I want to start off with a quick introduction reintroduction to How you can use our in a reproducible manner? By creating projects and scripts and we're going to do a quick data import. So it's really these four things And if you'll bear with me once I get that done, I want to open it up to Stuff that you didn't maybe didn't understand in the In the videos if you watch them and I need to move my screen around here a little bit Why is this not working as well as I want Here we go Put that over there and I'm going to Use zoom to connect to a virtual computer If I can find it where is it It's not there. It's not there Where did my virtual computer get to? Did I turn it off? No Ah, it's right there. Oh, I've got more windows than I realized Okay, so this is my virtual computer. I hope you are seeing At this moment just a blank blue kind of baseline windows screen If you are then we're in the right place If you're not maybe throw something in the chat so that I can Correct that Looks good. Uh, all right Great, great So like I said, I'm going to I'm going to try and cover sort of these basic aspects of reproducibility scripts projects and import I'm going to start off by clicking on my r studio again. Our studio is a mask that sits on top of r For this workshop, I downloaded the very latest version of our studio and it has a really nice new feature in it That I would like to mention and I'll point out to you But in case you haven't seen r before this is just this is our studio sitting as an interface on top of r And on the left hand side is the thing that we call the console Which is really the direct interface to the r kernel So you can put in it's and that makes it really just a big calculator You can put in just any kind of math and get a response Or you can do basic r Activities like let's say I'm going to create a vector of character names So I'll create a vector called names And then I'll use the conventional r assignment variable, which is a less than and dash symbol And then I'll use the concatenate. I think that's concatenate. I forget what c stands for actually But let's say it stands for concatenate I'll use the concatenate function and I'll add my name And I'm going to pick out some names that I see Hiro and I'm just going to do three Heather And so if I do that and I hit enter What happens is nothing seems to appear here and that's because I have the assignment variable But up here in my environment tab I see that I have a new object name and it's a character string of three elements And it's those three elements if I wanted to see that I can just type names And when I hit enter I'll get the value of that particular object Okay The thing is if you string together a whole bunch of commands like that It's not very reproducible and the minute your data changes you have to go back and remember all the steps you did And That's why in r we use scripts right, so a script Easiest way to get started with a script is to click on this little green cross up in the upper left And if you click on that you'll get a mini context menu And you can create lots of different kinds of scripts. I'm going to recommend one way But there's there's I'm not going to say that there's one one right way But I will tell you that I'm very bullish on tidyverse approaches So I'm going to create this thing called an r notebook What an r notebook will do is it will generate this blank code in the editor portion of my R studio so I just split that and what I am going to do probably is I'm going to minimize the console because I Don't personally want to keep on working in the console All right, and what I see here is something that's called It's called a it's an example of something called literate coding and what literate coding means is that you can combine pros or natural language With code and so the code exists in these little blocks here This is one code block or code chunk And there's pros on either side And so what you can do with that is you can actually narrate and write your reports in the pros part And then interspers whatever is needed To generate visualizations and analysis in the code chunks And you could have as many code chunks and as many pros chunks as you want Up at the very very top is just some basic metadata. So a title for this document And then I can add some there's again basic metadata. You can learn more about this, but I'm just going to put in my name John Little In quotes. I usually put in a date. It's easy enough to just put in a free text natural language date, but you can do things like Put in back text and put an inline r function there sys date Now I know that this might be going a little bit faster. Remember this will all be recorded. So if something doesn't It's going too fast. Don't don't fret And what that would do is pull the system date Just as this is as I'm saying that I'm noticing all of a sudden that my I feel all the sunlight coming into my room So I want to apologize to you if my face is too bright or bothering you I repainted my I'm working from home and I repainted the room that I'm working in And I took down the blinds and the new blinds haven't come in. So I have no control over the light Um And I may end up like leaning forward so I can see my own screen well enough Okay, so this is the metadata portion. It's called a yaml header Right, and that's and you'll see how it works in a minute Now everything else is marked down and it will feel like old school like 1970s style Markup to your pros for example, this is of link and This right here is surrounded by asterisks single asterisks That makes it bold and that word is bold Right. All you need to know about markdown. You can find out from this link right here in our studio. You could actually hold down the shift key Oh, well Great. It's not working for me, but um I'm sure I just have to reboot. Uh, this rarely happens, but it didn't seem to happen today, which is typical of a live demo If you hold down the shift key, you can usually click on that link and it'll pop you into a browser To find out more about our markdown, but there is an easier way. You can just go up to help And choose markdown quick reference and in the help box You'll see a whole bunch of information like how you make words italicized You can either wrap it in single asterisks Or wrap it in in single underscores And bold is the same way double asterisks double underscores You can make first level second level third level fourth fifth sixth level headers You can you can make bulleted lists bordered lists Examples here about how you use our markdown. It's all right there Normally when I'm working I just delete all this stuff off the start But I'm not suggesting that you do that if you're a newbie This is all useful information designed to help you soon You will have it completely memorized because it's not really that hard to memorize Uh, but you don't have to memorize it because it always comes up Like you can leave this here until you're done with your report But what I'm doing is I'm composing a report. So I might start out by saying second level header executive summary And I might compose something to my audience saying, uh Sorry, I can't talk and type at the same time But there's an example of a second level header and some text and then I'm gonna oftentimes I'll do something like If it's a technical audience, I'm going to do something like this import Uh, sorry load Library packages And I'll make it another code chunk I can insert a code chunk right here by clicking on this green button, but Just note again, like I said all the information needed is right there. This tells you how to insert a code chunk So I can do that And insert a code chunk right there And I can we'll do more of this but I can type Library tidyverse and you'll see that it's tab completion The less information I have the more options it gives me So I can go down here to tidyverse and hit tab and that would load my library And then um The other thing I can do is I can execute these code chunks by clicking on the little green arrow Or I can click the run option and run all Now why this leads towards reproducibility is a couple reasons one like I said you have all of your code right there And you have your analysis and your pros there So take for example, and so your output then in this case Would be an html notebook, which is a document you can share with somebody and they don't have to have r They can see all the holes They can see either a polished report or a technical report whatever you can change that output variable to slides to dashboards to uh shiny dashboards To websites to ebooks to e-pubs to articles I don't know what i'm forgetting, but the list goes on the point is You can generate all that from one bit of code that lends towards reproducibility So that's something called literate coding and Um and in our notebook in this case if I was going to change it I could change it to a pdf file I could change it to a word file So just so you know, I rarely ever myself ever open up microsoft word anymore I just do everything here you might say hey, but I don't I don't really want to write in that old 1970s computer style manner And what I want to point out to you is that there's this new little icon in the latest version of our studio It says switch to visual editor So you could do that i'm going to do it right now And it's going to switch over To something that you're probably a little more modern looking that you're more familiar with right so I can use this and instead of wrapping packages in asterisks to make it bold I can just highlight it and make it bold and I can highlight this and make it italics And I can highlight this and make it a link And it works just like uh, uh, many of you who are who are used to more modern Editors Would come to expect and appreciate Right i'm going to switch back because um, I find that I am still in a space where I prefer The other method, uh, but I want you to know that that's there. So i'm just going to click on this again If I can Come on, there we go Oh, I quit I did it too quickly. I was impatient Notice that what I did where I just made the word packages bold it did this exact thing and wrap that in double asterisks I made our studio italicized so it wrapped that in single asterisks And I made the word analysis a link and it wrapped that in the bar markdown for making links So I said that you then have pros and you have code chunks I'm going to execute this code chunk by clicking on this green arrow And by the way, soon I will make my font larger if it's small to see but I I would like you to be able to see the full features of our studio for that When I execute opening just this tidyverse package, which is a suite of modern data science like packages What it's really doing is it's loading Eight packages at once now. There are two things about packages Sometimes you install them and there's a name for installing them, but I will just point out over here In the bottom right quadrant where it says files plot packages Right, I can install from here. I can click install right here and I could type in gapminder And I could I'm not going to but I could install gapminder right now by doing that You only have to install a package once on a machine You may have to update it, but you only have to install it once on the other hand This is called loading a package where you type library. So if I had gapminder Installed I would also load that Looks like I do have it installed And then No, so the so you only have to install them once but you have to load them every time you open up our studio Or every time you open up the script, right? So I'm going to put those at the top And I'm going to run it and when I ran that Our studio came back or our came back and told me that the tidyverse package is actually a conglomerate Of eight packages that it loaded When you install tidyverse it actually installs like 50 packages, which are all helpful But most of them are foundational and they sit in the background and you never have to worry about them But these eight packages are useful. You've got ggplot for visualization You've got tibble, which is for data frames keeping your data in a grid format Tidy are and the plier are for manipulating and wrangling your data Reader is good for importing csv data Per is good for iteration. So if you want to iterate over a data frame or a list Stringer is very handy for using regular expressions and manipulating text And forecast is A library it helped is helpful in You working with categorical data, right? So factors The other thing it's telling me is that um, there are some conflicts because there's a filter function that's associated with the plier and there's a filter function that's associated with stats and I can use the long format to to address Either one of these but what it's telling me is if I just type filter I'm going to get the supplier filter Which is masking the stats filter All right By and large you can kind of ignore this but it's helpful to know what you're looking at Does the same thing for the lag Now if I have a really technical audience, they might not mind seeing that feedback, but In the context of my report, if I don't want them to see that feedback, I'm going to click on this here in the And Turn off my messages And I might even choose this option where it says show output And choose show output only where it doesn't even show the code chunk But I'm going to make those decisions selectively Some audiences in my reports will want to see the code chunks in the analysis and sub audiences won't Okay So it makes some changes up here in this line And I just I can click that I can run that again and Get the a different kind of output Here's a different example of a code chunk. It's taking the cars data set And it's making a scatter plot out of that. This is an example of base r Visualization by base r. I mean not tidy verse The stuff that comes with r Uh, and so when I execute that I get an inline visualization of a scatter plot All right When I save this I might do one more thing. I might change the title of this document to Hello world and click save Uh, or actually I'm not going to click save. I'm going to I'm going to click save up here And it's going to let me move this I'm going to go from packages back to files down here in the lower right hand quadrant And uh, I'm going to it's I'm going to click save and it's going to prompt me to give it a file name So I'm going to call it um, hello world Two because it looks like I've already done hello world And when I save that if you look real closely down here, what you'll see is it will make a derivative report So there's hello world then it made the derivative report and here's hello world I'm sorry. Here's hello world two and here's hello world two dot mb dot html So if I click on that I can open it in a browser And that's a it is a it's in a web browser, but you can send that complete html document To anybody and as long as they have a web browser, they're going to be able to read it Again, if I go back to This notion that I can have different kinds of outputs Maybe I want to create a pdf document and send that to it. Maybe I want to create um A word document and send back that back to them. I'm sorry for being so late To looking at the chat But it looks like For this workshop. Are we using our markdown or r-script? So we're using our markdown And and drew answered that so thank you drew. Okay. I don't have to worry too much um And the reason why we're using our markdown is going towards reproducibility And demonstrating uh literate coding right and so going back to this sample report Right, this is something that I could send to a non r person And they could read it And see the output they may not be again They might not be interested in this technical aspect, but they might be interested in just the chart So I could send them that report All right. I am almost done. I want to talk about projects here real quickly um A lot of old school r people are in the habit of using two commands a lot And they'll put them up at the top they'll put um set wd set wd and here they'll they'll um Put in a idiosyncratic file path to something And they'll also run this command and if you don't if you're not in old school if you're learning don't memorize this I'm gonna my comment here is don't do this And they'll also do this rm Equals something like this. I do this so infrequently that um Something that's not exactly correct, but the idea behind this command at line 10 is that you're cleaning out your environment variable And the idea behind the command at line nine is that you're setting the working directory for your project The problem with both of these is they're not as effective as you want them to be in a reproducible context For example set wd means that if you move to any other computer And you will upgrade your computer someday It's unlikely that your new computer will have the exact same file system as your old computer Which means all of your scripts Minimally have to be updated right there If you use rStudio projects, you'll never have to do that and you can refer to Data and through a relative pile structure, which is a preferred reproducible approach Similarly, they're using this to clean out the environment because they want to start from scratch every time But what i'm going to do is i'm going to Comment out each one of these so they don't run But in an rStudio project environment what you would do is you would just go run all and you could always do this One first you could you just restart In this case i'm going to clear the output so you can see how This returns it back to normal and then you can go back to run all And it'll run the whole thing And what you're certain of is that if you can clear your output and restart and run all That you have a reproducible script assuming you've used relative file paths You can share that with somebody else on github You can share that with your boss your colleague your co-worker You can share it with yourself, which is something i do all the time. I have one computer at work different computer at home So that's why you want to use projects An easiest way to use projects Is to go up here where on the upper right where it says projects And click new projects All right, this is asking me to save something just save the latest version Of this unsaved document so i'm gonna click save And it's going to give me a dialogue box here in a second And you can see you can choose different ones if you're using version control you can pull projects directly from github I'm just going to do a blank new project. You can also Start book down projects, which is to generate ebooks or start websites Different things I'm just going to do a blank project by default for me on windows. It's going to put it in my documents directory So i'm going to say i'm going to call it dinosaurs Assuming that I have some project where i'm evaluating dinosaurs Right and what it does is it creates a different view of our studio You can have multiple our studios running at one time. They won't bleed over each other They won't share the environment. So you're not going to you're not going to have Any danger of pulling of accidentally having the same variable name that kind of thing And then it starts with only this one file. That's the project file So i'm going to close this real quickly if you'll keep up with me I should not have closed that and I hope to Be able to restore all that. Yeah If I look in my file system Under documents, here's my dinos project And it has my art history and and this one other file Our project file, which is really tiny and all it does if you double click on it Is it'll launch you right into that our studio project? all right so uh in a minute i'm going to cover one more aspect of Importing data but I want to pause here because we've I've talked kind of fast I want to make sure we all have the same introduction to what how this environment can work for you And I want to give you all a chance to ask questions We can literally i'm happy to go straight into questions that general questions ideally That were unclear in the videos that I sent out or that um You ran across some of the exercises And something didn't seem to work right and I would love to pause Just unmute yourself and start asking your question ideally or if you want to you can put it into chat And see if we have any questions right off the bat and that can include um That can include the stuff that I just covered I'm going to join about the um Oops, sorry to interrupt. Can I go ahead? Oh, please do. Sorry. Thank you. Yeah, um, it's about the first practice exercise um Yes, specifically about using The left join function um Okay So, um, I'm when I try to run this the left join function I'm getting the error that join column or join columns must be present in the data and there's an error and um Calling like calling the data by name. So I'm wondering is that an error and um, the way I'm running the function or is that I have something to do with the way the favorability csv file was set up If you know, maybe I'm calling it wrong or something like that Yeah Right. I'm uh, I'm with you. Um, just to make sure I understand. Are you in this file exercise underscore zero one? Um, we're in a different Oh, I'm this is a separate Notebook I set up from the the part one video Oh, I see. Yeah, okay. Let me see Let me just Open up a couple files and make sure I have some good examples of I don't quite know the answer to your question Because I don't understand here's one thing that's really true about r is that um It has some horrible error messages um And a lot of times you just stare at me go I I literally have no idea what that's supposed to be telling me And so what I'm looking for is my example of left joins Um And I know I have one So I'm going to go down here Find in the files and type left join and see what our studio comes back and tells me Uh Oh good. There's one in quick start, uh, but I don't know quick start. I'm surprised that that No, I think I know where I want to be. I'm going to switch to a different. Um Different project. I think this is the one I want. Yeah Let's try that double click on that Um, could you tell me Ellery? Could you tell me the error message again? Yeah, so the the context of it is we're we're using some some functions from Deployer And we are trying to join the favorability data onto the star wars data um So the the line specifically is calling the left join function to join favorability by name And the error message I'm getting says join columns must be present in data so Yeah, I was thinking maybe Favorability data because that was the one that we referenced by the url to the github repository It sounds to me like It sounds to me like you have two data frames. I really wish that I could find A good example And I I'm embarrassed that I don't have an example handy But um I'm going to find one I'm going to go here And I bet it's going to be able to find it there It sounds to me like One of your data in order for a join to work you have to have A key That is the same variable in each data frame Okay, and the same variable has to have In order for r to figure it out automatically that variable has to have the same name in each data frame Now you can manually override that and you can say, you know, I want to join hair color with Weight and and so anywhere where hair color equals blonde if weight equals blonde Which I know this doesn't make any sense, but if if it can make that connection it will then join data Um However Without seeing the full thread Of what you did since you made your own Kind of example in your own editor um It would be probably uh There we go. There we go. That's exactly what I want o2 join skimmer. So I'm going to just demonstrate it here um So in this example, I'm going to load two libraries tidy versus skimmer Uh, we're not going to get to skimmer But uh, the first thing I'm going to do is I'm going to read in some data and in this case I'm skipping 11 lines of data Because the first 11 lines are a provenance of the data set which actually came from 5 38.com. I don't think that we're seeing your screen Oh, thank you so much. Um, whoever said that. Thank you for Let me get back here All right, you should be when I click share you should be seeing a blue screen dark blue screen So I I I ran these two libraries at the top Uh, and then I'm gonna I'm going to read in this data using the read underscore csv function And that's an example of Referring to the data relatively because all I had to do since I'm in an r studio project all I had to do uh Was refer to the directory. So I'm going to expand my screen What I mean by that is relative data uh file naming. This is the data directory and in the data directory Is the 5 38 file that I want to read in So by using that relative path Uh designation It becomes more reproducible and then I'm giving it one argument Let me minimize this The skip argument which is skipping the first 11 lines because if we looked at this data raw Which I think I can do Uh view file You can see that the first 11 lines are just telling me where the data is coming from So I want to skip all that stuff And when I run that I'll get then an object in my environment space That says favorability popularity rating Wait a minute Why is it not in my environment? Oh because that wasn't on the right tab And I know that my this is now a data frame has 14 observations in two variables If I click on this icon right here, I can actually get a data viewer that will allow me to sort of preview it But I generally I personally generally don't do that. I generally if I really want to see the data. I'll just Use this same object name again Below it and I will um just display it below So there's my favorability rating data has two variables name and favorating Now i'm also going to use this star wars data, which I didn't have to import because it's on board It comes with the tidyverse specifically it comes with the plier And the star wars data is 87 rows Of information about star wars characters Right and so in this join my join key is name and you'll see that my variables Have the exact same spelling there that doesn't it leaves nothing to chance um, and what that means is if Luke Skywalker has as a variable in this other data frame Luke Skywalker from star wars matches Luke Skywalker in the name variable from favorability rating then it's going to pull over This value of this column or all the other columns depends on how I How I make that work right So let's go back down here to to the join Happened right here So the way I read this if I read this from left to right and we haven't gotten through all of this So some of you bear with us Because I've got a new object name SW underscore joined Gets value from that's the way I read that in my head and that's an assignment variable gets value from Star wars the on board star wars data set This is called a pipe Which means I can read that in my head is and then So star wars and then do a left join with this other data object which happens to be a data frame And then in this case I'm I don't have to name Both columns that are the key Because they both have the same name So if I run that just that part of the command which I can do I can highlight that and do control enter I now have one additional column Instead of a 14 row column in star wars I now have a 15 row combined column and I can scroll over to the right And you can see a couple of those joins were successful And uh, the ones that weren't successful. I got n a's are not applicable Not available There's a few more tidy verse commands going on here But what I would say to you allery is you probably I'm gonna give me a second here and I'll show you how I would solve this problem is I would hit f1 I would first figure out what are the join columns between my two tables Do they have the exact same spelling if they don't have the exact same spelling? Um, I'll either make them the exact same spelling by doing something like Star Wars rename Let's say rename Type equals species something like this And that's going to rename the species variable to something called type Like we can see that in action right here if I just display this Oh, I put it at the end Or it didn't Where did it put it? Why am I not seeing it? There it is put it right there Um, so I'm either going to rename it or I'm going to explicitly Call out those two issues So I'm going to highlight left underscore join which is my function and I'm going to press f1 Because that's going to make it easy to find my onboard help And then I am going to look for The answer for how do I want to write that? in the examples And actually it doesn't appear like there's a good example, but there might be an example down here Pretty certain. I know how to do it, but I'm just I'm wondering if I can This is by the way, this is very typical of how you use are you will get exceedingly adept at typing How do I join tables with a left underscore join function and then use the phrase in are or with tidyverse And then you'll get back a whole lot of useful information But it would look something like this. I'm not seeing it right off the bat Uh, I probably not reading. Oh, here's something here. It is right here. Oh, that's yeah, so what that's saying is In one table the join key name is going to be matched in a different table with the join key artist And they're going to have the same value Okay, so that's probably what's going on there. Yeah, I I took a look at it Yeah, sorry. I think the problem was the the favorability csv I think was maybe read wrong from the github repository. There's no name section in it So that totally makes sense because it can't find the name So I have to go back in and make sure that it's reading that file correctly. I think Yeah, you might need to you might need to employ the skip argument um because Just because of the way that file exists and and by the way, I am more than happy to Either if you're staying around till the end, we'll we'll drill down on that specifically Or we can do a consultation um and cover that more at some other time It's important to be able to join data frames together Okay, so John before we um move on too much had a couple of folks had questions about um If you go over the difference between sort of our notebooks our markdown our scripts and our projects Just for for people, maybe you're only used to using the r-script Okay, so Let me see if I can go back to my Other cloud computer So and our our markdown is um Let me explain it this way For people who are used to looking at at web pages You see pretty web pages, but if you ever view source on a web page You see a whole bunch of stuff That is not so pretty. So if I go to wikipedia um and I click on I'm trying to find a place to click right click view page source This is the stuff that makes that page Possible to be rendered as a pretty page and it's called html or hypertext markup language So markup language. There's all kinds of markup languages. There's sgml. There's um xml. There's html and markdown Is a markup language The reason why they call it markdown is because it's a simplified markup language that allows you to put structure to a document so that you can have things like bolded words and italicized words and the reason why we're doing that at all Is because I am wanting to teach you a method that is called reproducible Programming with literate coding right and so literate coding Is this idea that you are interspersing? pros or natural language With code chunks and in that way when you have those two things interspersed you can actually write your whole report from here And then generate the output of your report by choosing a different variable right here So if I want this report to be a um A report to an executive aren't going to put executive summary executive summary um cars are I should probably capitalize that because it's going to the executive's cars are unsafe at some speeds right and then I'm going to say Visualization or I'm going to I'm going to call it chart Here's Here's my evidence Right. I'm writing myself a report I'm writing a report for my executive audience And I'm going to Then do my analysis I'm skipping the part where I import my data set But just so you see if I typed in cars. This is another on-door data set. It's a really simplified onboard data set Two variables in a data frame speed and the stopping distance right, so if I Then I want I want this I want to show my executive this chart But I don't want to show my executive The code for that chart I'm going to click show output only and apply that And then I want to make that a microsoft word document because my executive doesn't like All the other fancy stuff i'm doing So that gives me a chance to rename it. So i'm going to call it cars report and click save And it should put it right there And actually didn't do what I intended It gave me the option to create two different kinds of reports and it created one So i'm going to go back up here and choose knit to word And now I've generated two different reports I'm going to minimize this for just a second. You can see that I have two different rendered reports from a single script I have an html notebook report and I have a word document And then I can send this word document To my um to my executive and so that's the the simplest use case I can present to you There are much more complicated use cases But the idea that all of my pros and my analysis are all in one single file Means that when my data changes, I don't have to rewrite the whole report. I don't have to Output my visualization into some different visualization editor and then copy the visualization editor and then paste that into microsoft word All of those extra steps Of clicking and pasting and copying those all break down a reproducibility chain, right because you can't document click here Right way you can document quite honestly You can document it and you can script in a visualization context But the state of the art really has come to accept that reproducibility works better in a command line context So that's why we do these kinds of things That's why i'm promoting it So I don't want to say that this is the only way to use r if you really prefer r scripts And all you want is is is the r script in your code Go ahead What I will tell you and for those of you who don't know what an r script is I guess i'll show you one Can I can I ask you a quick question before you move on? Um, so whenever you Generate that word document If you were to go back at a later time and then update the data somehow With more cars reports. Um, and then you save that and click back on to this First generated document will that automatically update if you run all the script again Yep. Yep. So let's do this. Um, wow, let's let's say that, um Let me go back here. I will do exactly what you said. I'm going to update my data, but I'm just going to update it in real time And I'm going to do it so that my executive audience doesn't see right. I'm going to take my cars dataset. I'm going to say new cars Gets value from cars. That's my original data frame And I need to take a quick peek here at what I'm doing because I don't have the variable name So gets value from cars And then that's this part And then speed Uh, oops, I need I need my mutate command mutate mutate speed equals Speed all right now. I'm going to make this up if you all will forgive me But let's say that that speed is in miles per hour and I want it to be in Kilometers per hour And I grew up in the u.s. And I can't do these if anybody wants to Unmute and tell me what the formula is shout it out, but I'm going to I'm just going to I'm just going to divide by speed divided by 0.8 and hope that that's sort of I don't even know what it is. I mean, I'm so Not thinking this way, but uh Let's just run this and you can see new cars I'm going to I'm going to comment that out for a second and I'm going to re-execute this command Uh, what did I do wrong? cars Oh, I spelled mutate wrong perfect mutate I myself am a horrible speller And for some reason why, um This I really probably need to restart this but I'm afraid that what will happen It's not working as smoothly as it usually does What else did I do wrong mutate? Could not find function pipe. Oh, that's because oops, I forgot I have loaded my tidyverse library And so I'm doing some really what I'm going to consider ugly code here is that I'm doing it all in one code chunk but Bear with me here I load my tidyverse library by loading and I I loaded my tidyverse library So I could use my mutate function because mutate is a function within the plier Which is a package within tidyverse And now I have this new speed, right? So, um So effectively I have new data. I just made it up. Um, and I'm going to I'm going to take out that command and I'm going to run this execute and I get, you know, it looks like the same plot as I had before Oh, wait what I want to do let's make two plots um plot new cars And so I'll have some what I'm going to do is I'm going to say, uh I'm going to make two code jumps and I'm going to say here's my evidence When speed is measured in mph and Evidence when speed is measured in kilometers kilometers per hour Right now if I go up here and I click run all I have two charts and the only difference is Wait a minute. That doesn't look right Oh, let me double check my Ali, I tell you live coding you're probably learning something um I'm going to make this a little I'm going to change this mathematical formula. So it's really really clear to me So rather than divided by point eight, I'm going to multiply it by 20 And That should give me yeah, there we go there. It's really clear That I have a different unit going on here All right So all I did is I generated some new data and I generated a new graph and then when I knit my word again um, it's going to re-render that and Now I have my report without any of the technical Why did that look that way? I'm going to do it one more time Oh, I forgot to turn off my library warnings But I have my two reports So I hope that answered your question That's the that's the idea of a reproducible workflow chain This is great. I'm glad you guys are asking questions I know I'm going to have some people who want to get into the guts of it, but If somebody else has a question or a follow-up or haven't explained that part well Let me know we're just about Didn't let doing this for an hour. We've got another hour that we have available to us So I like to be patient just in case somebody's trying to decide if they're going to Ask a question or not Yeah, john, I Yeah, I hate to get to in the into the weeds, but um from from your perspective. So if you're working in say tidyverse and and you also want to do some stuff with uh With another library called data table Do you want to keep it all in the same idiom? Or is it okay to mix and match different libraries? It's it's totally okay to mix and match different libraries and um That is the nature of that comment. I was making before I don't know if I can reproduce it, but When I loaded the tidyverse library down at the bottom of the feedback it said that Deplier filter was masking stats filter, right? So you can have uh the whole point of having libraries is is extending your r into various realms And the tidyverse is a very generalizable realm But you may have an expertise in some kind of modeling And so there's another package you want to bring in and you want to run those at the same time Now the data table package Does some things that are similar to what tidyverse does But you should be able to run them both side by side You just have to keep a look out when you run them When you load them, are there any conflicts that are preventing me from using a possibly identically named function But even still Just because there's a conflict it only means that it's masking the short version or the convenient version of the function name So if I typed When I run tidyverse, I no longer have access to stats filter But if I start a new code chunk Right down here at line 36 And I uh I can still access stats filter just by typing the long version Which is to first identify the library and then two colons and then filter And then you know, whatever the arguments for for stats filter So you can use them both just have to be more careful where there are those kinds of conflicts Great question All right So I'm really glad that you guys are jumping in there uh What I'm gonna and I am still open to anybody jumping in and asking another question But what I'm going to propose also Is uh while waiting for that other question to come in is that when I move forward I will do one more quick demo on How you can leverage that relative path and project concept When you're importing data and show you some tips about importing not just csv files, but excel and and uh data and sass There's a really nice import wizard that's worth knowing about Excuse me and then um If there are questions there we'll deal with those and then I'm going to move into some more specific questions about the deployer Package that sub package of tidyverse, which is all about data wrangling So how do you subset rows and columns and create new variables like I did earlier with mutate? all right So what I want to do Is I'm going to switch To my web browser And you're welcome to follow along with me I am going to go to uh this particular New repository that I've made and put up on github Just intro exercises Now i'm going to put this link in the chat So you can click on that or save it for later um But what i'm going to do is i'm going to show you how to i'm going to show you the Never fails always works way of getting code from github There are more convenient ways to use our studio in github But uh, we don't have enough time to go into how you configure that But what this is github is a way to share your repositories or if you'll if you'll think of it this specifically Your r projects so that somebody else can grab them now You can make them public so that anybody can grab them or you could make them private and designate who else can look at them But this is a public repository It consists of these files that you see here plus this Folder called data which we could click through um, and it has in this case a readme file And I want to download all that and manipulate it Uh, you'll notice by the way that it has an rproj folder in it. So when I launch that from rproj Um, I'll be able to be in a special r project and leverage the the values of an r project So I'm going to click on this green button here that says code And I'm going to download zip Again, people who are more used to using our studio will know that and github together with for example a package called use this We'll know what to do with this stuff. We're just going to do it up the the never fails way I'm going to click download zip And in my computer that created a zipped file Somewhere on my computer, which I know from experience is going to be in the downloads folder Uh, I know from on a windows machine. I can just click on that and go straight into the downloads folder And here I have a compressed file that has This one compressed file has all of this stuff in it And it's important in this never fails method To expand that zipped file Um, I think on max you can probably just double click on the expanded file and it will expand I don't use max so I'm not 100 certain but at least on windows It's very important that you expand it because while I could look into that expanded zipped file I actually want to expand. I don't want to just look into it. I want it to be expanded so I can write back into it Read and write from it So I'm going to right click on that And I'm going to click on extract all And windows will give me an option of where to put that so I'm in this case. I'm just going to put it on my desktop and I'm going to I guess it's going to give it whatever name it gives Click extract Okay, and then it opened up a new folder for me But let's just do something here when I still have the zip file and then I have this file I'm going to minimize everything And just look at my desktop It's right there And when I open it up There's all the files that I just downloaded and what I want to do is I want to open this as a project So I'm going to find this r project file And I'm going to double click on it And that's going to launch me directly into Our studio and it's already set the working directory so that I can do things like I'm going to do in just now which is an example of importing data with a relative file path So if I open up a new project and start down here I'm going to put in a new code chunk. I'm going to use the code chunk keystroke, which is mentioned right here control alt i And I'm going to say my data And I'm going to do alt dash to do my gets value from command, but you can type that out My assignment variable gets value from and I'm going to use oh I got to stop I already realized I've gone too fast I do this all the time. I'm so used to using tidyverse that sometimes I forget that I have to load the tidyverse first So I'm going to insert another code chunk and I'm going to try and Use some brief examples of literate coding load library packages And I'm not I don't have any pros here, but if I did I would put something like the tidyverse is a great data first Set up packages and then I'm going to load that package tidyverse And then I'm going to run that package And I'm not sure if I'm going to get any feedback this time. I did that's fine. I'm also going to Just click that. I just clicked that little x that came out tidyverse is only verb verbose on a kind of an initial basis I'm also going to for me. I'm going to get rid of all this stuff because I would encourage you when you're new to not get rid of it off the bat. It's helpful information But I know what's there So now that I've got the tidyverse loaded I have access to my pipe function Which I can execute I can access by typing control shift m or command shift m on a mac And all that pipe function is is a conjunction. It allows you to string together a sentence if you will of our functions So I'm going to read this almost from left to right Oops, I didn't actually need to do that. Ignore that. We'll come back to pipe now that I have Sorry, I shouldn't Now that I have tidyverse loaded I can use the tidyverse version of reading and data So if I type read and just stop My context menu pops up with read.csv And what I'm going to tell you is there's a sort of a visual cue That if it's a dot It's not part of tidyverse Um Now you can use this to read a csv file. It's from the utils menu It says right there utils and it gives me Some brief help on how to use this function But I want to use the tidyverse version So I'm going to put in an underscore And it becomes the second option and I'll just choose that and The technical reason why is because read.csv and this actually recently just changed but it changed because they're following the tidyverse lead at Kind of our mothership. They're following the tidyverse lead tidyverse made this change a long time ago Is that they by default do not read strings in as factors So If you happen to work with categorical data, that's a really important thing And many people have come to decide that the factors Using factors as a data type is is often more help hurt and help I don't want to get into it too much. I just want to say that I recommend to you Unless you're really really fond of using factors That this is a better way to import data read underscore csv is better than read dot csv But neither one is wrong One's just a little more modern All right So then I'm going to type my single quotes in our studio automatically put in the opening and closing I'm sorry. I took double quotes in our studio automatically Put my cursor in between the opening and closing double quotes And then I can hit my tab key And it will give me a context menu of the file system the relative file system based on the r studio project And so I can just scroll down and choose the data directory And it's going to give me another context menu in this case. There's only one file in there Durham supermarkets dot csv right if I go into this data directory Down here. It's right there All right, I'm going to put my cursor back over here And I'm going to hit the tab key again and since there's only one file it filled it out All right That is how I would want to read in a csv file and write the script in a reproducible fashion So when I execute this code, you'll see up here in the environments variable I'll get a new object called my data up here And it tells me that it's a object of 84 observations and 26 variables And if I want to look at it, I could put in I could put in another code chunk And I could write down my data and hit control enter And I've got this big data set here that once again Has some stuff at the top about provenance that I put in there I personally stuck that in there. So I would know where the data is coming from But it makes it a little harder to read in and it's not exactly in the format I want So I'm going to show you this thing about data wizards Data import wizards there they exist in kind of two places one is Generally speaking under the environment variable. There's this thing that called import data sets And so from here you can import text data Either from base r or from reader Reader is the tidyverse. I would always recommend the tidyverse over the base r Or you can import an excel file or a sass file or a stata file anything like that That's one way to get that the other way is just to click navigate into I'm going to click on this little icon So I get back to the project root And I'm going to navigate back into the data folder And I'm going to left click on the file And I'm going to click the import data wizard here. It's the same data wizard I just find it more convenient to get to it from there And what this does is throws me into a data preview window And allows me to make some changes Because I want to skip line one. I want to use these this line one as the file headers By the way, this data comes from the open Durham data portal And it is some information about supermarkets and convenience stores So I'm skipping line one. I'm going to go down here to skip. I'm going to put in the number one and I'm going to hit tab And it's going to re draw this data frame with the actual data labels Or variable labels at the as the very first thing And then what it does is it gives me this really nice View of all the code I would need to paste into my Our markdown script or my r script in order to run it the same way every time Now I will tell you that you can click import right now and that will work But I've lost if I click import without doing anything. I've lost this code Only temporarily so a convenience is to copy and paste it Right, I can right click and copy I can click on the uh clipboard icon I'm going to click import I'm sorry. I'm going to click cancel because I already copied that into my buffer And I'm just going to paste it right here And then I'm going to tell you what I'm not going to use out of that right I am a not going to use that Because uh, it's redundant Because it's one of the eight libraries that shows up when I do tidyverse But if I mean redundancy doesn't really hurt anything if you left it there it wouldn't hurt anything And I'm also going to comment out line 21 Because all that's going to do is throw it up into a data viewer that I don't personally ever use but um It's happy. It's handy to know that that exists So this command here And this command here are almost identical different object name that I'm loading it into and I'm really doing it because I just wanted to remember this one little bit of of syntax that I I probably wouldn't remember otherwise, which is to skip one so now If I execute these two I now have two objects in my variable in my environment Pain and they're both the same because it read in the the data file twice and just gave it different object names So now I'm going to scroll down here and I have my data and I also have Durham supermarkets And if I execute that code chunk You can see that I now have two inline viewable data frames That are visually into I mean they're visually identical and in fact they are identical So you don't see any difference, but I can scroll to the right scroll to the left And I can get through the first like I think thousand rows of any data frame that way So you're working with really big data. It's going to chop that off because It loads it up into memory and you don't want to bog down your computer John, there's some questions in the chat about if you could maybe go through that process of getting to the data Viewer again Uh, sure. I think and uh, please re-ask if I'm not asking the right question Oh, you're talking about the import wizard. Yes. Yes. Yes in the import wizard Um, all right. Let me do that again. I'm going to once again. I'm going to click on this little icon right here Which brings me back to the project route now if you didn't Open this up as a project Uh, like if you're doing this on your own and you didn't make a project This icon is not going to work, but let's ignore that. Um most of you assuming you You clicked on the r project file. You're going to be exactly where I am um And what I want is I want to get to the data wizard and I know that my data is in this data folder So I'm going to right. I'm going to left click on the data folder And I'm going to left click on the file that I want And that gives me a context menu where I can find an import data set Now I always like to point out that it's the same access to the data wizard as what you will find up here in the environment tab But the environment tab allows you to manually kind of make a Determination over what the file type is whereas this is going to do its best guess based on the file extension So I'm going to go to the import data wizard and There I am I'm in the data wizard. I can change the name of the object that gets Changed into and you'll notice that this code gets changed dynamically. So I can call it foo And I can decide how many lines get skipped. So one I could Change the delimiter. Maybe it's tab separated data, which it's not Of course now it's Not properly read in so I'm going to change it back to comma But as I change all those things all of this Code this code preview changes dynamically with me And then I'll just copy that And in my practice is to paste that back into the code chunk Because it's a really easy way to generate the sort of I'm going to say syntactically verbose code that I never seem to memorize Right. So then once I paste that in I just personally start getting rid of The stuff I don't I know I'm not going to use And I can run this again this code chunk one more time and now I'm going to have Three objects that are all identical because it's all reading them the same file Good all right So I'm going to click on this icon one more time And I'm going to go to this file called o1 a the plier And uh Marcos, I don't know if you're meant for your microphone to be hot, but my eye just Caught a glimpse of your screen with a green thing. So if you have a question go ahead and shout it out No, it's no question. Just wanted I just wanted to thank you out loud for doing so I think this process that you just went back Uh through again really Really was very helpful. So just wanted to thank you out. Thank you Lovely. So thank you very much for that. I just wanted to do that I appreciate that. Thank you. I'm glad to help you're you're very kind Marcos It is my goal to make this Understandable for people who are When you first approach are it it can feel like a tangled mess So I'm glad it was useful for you um All right, so uh All right, I'm going to go to this file unless somebody else says something o1 a de plier dot rmd That's the r markdown file. Can I ask a quick question? Do you have a question? Sure. Yeah, I do. You don't have to walk me through how but maybe just a point in the right direction. Um If you're working with survey data from Qualtrics or Redcraft or something often if you export to a csv It'll give you many many rows of metadata in the first couple Does this have a Way to automatically assign if you skip say four rows Or is there a helpful way to To link those labels. Yeah, as a matter of fact, I was I was actually demonstrating that this skip function Uh But you may be asking specifically for a Qualtrics formatted data rather than csv data Um, so two answers one is yeah, there is a skip function Go ahead. Go ahead Oh, I was just wondering I guess if I had four rows it it will always link It it won't just ignore it. It will never ignore The extra rows it will always link it in some Whereas like I've struggled with status. I think people I think the answer to that is yes, there's sort of I would love to see an example of what you're talking about because I don't have a Qualtrics data format in my head Although I'm familiar with Qualtrics So a couple answers one is there is a skip function That I that the way I hear your question is how I would first approach it But I may not may not hear your question properly Another thing to point out is that there is a Qualtrics library for r that makes it easier To to import data directly from Qualtrics and I've had I don't know 60 70 success with that It can be really handy when the when the When the the light that particular library Is approaching the data the same way you formatted your data and export and can export it Qualtrics can be a little goofy sometimes and then another comment that I would make is Data labels can be a real challenge And especially when you have lots of columns. There is also a package called janitor which I personally tend to not use probably because I'm a little too obsessive compulsive but But what janitor does is it will it will clean up all of your file names To file. I'm sorry variable names To names that are much easier to deal with Because variable names in this context in our context really get any kind of serious data context You don't want your variable names to have spaces. You don't want them to begin with numbers There's a lot of rules about how you can create a good variable name and janitor the janitor package will help you Do that automatically So I would I would say to you zoe that probably Some combination of Arguments for read csv or the Qualtrics library or the janitor library is going to help you a great deal and if you want to Set me up with a consultation on that I'd be happy to look into it more specifically Um That was really helpful. Thank you. Great. Okay. Um, all right, so I am once again, I'm going to open up this library or this file But I want to point out that uh, you could also open up the the one that's Has the same file name but underscore answers Because that one has the answers directly in it and a report of the answers So you could just look at look at it on in a web browser But this is the file. I'm going to open 01 a Deplier rmd rmd standing for r markdown And it opens it up in an r in an r markdown document which will output an html notebook like we've seen some so far And uh, the first thing I have there is the word deplier bolded And some information about where these files are coming from Now in my first code chunk I have this line that says install install dot packages dot gap minder or Not dot gap minder, but you can see the syntax that is the proper syntax I want to caution you I myself don't like to put install dot package Into my scripts, especially scripts. I'm going to share it with somebody but in this case this is this is uh Usually putting that in there is going to be non-destructive But you never know when somebody has some kind of dependency on a particular version of a package um, I don't like to overwrite people's Uh libraries if I don't have to but gap minder is just a training dataset and if it overbrit wrote your gap minder, um, I'm 99.99999 positive it wouldn't cause any problems So I left it in there, but again install packages. You only do that once you can easily do that from the packages tab and then Loading the packages you do every time Every time you run the script. So I'm going to I'm going to execute that first code chunk It's going to install gap minder Which is some data of some population data from from specific years from about 1952 To almost the present I think it's four year gaps five year gaps for Across all five continents for many many countries And we're going to look at it in just a second Right. So the first thing I usually do when I get a new dataset is I use the glimpse command For those of you who are used to old school base r glimpse is very similar to The command stir or str or structure depends on how you want to Pronounce it but str are the letters and I'm going to execute this And I find that I think glimpse displays the data a little bit better glimpse is especially helpful when you have a really really Wide data frame let's say 30 or more columns Because what it does is it lists each column name Down this first column. So I have I know I have six columns Sorry and the columns are country continent year life pop And gdp percent cap Now I want to do something here that I hope will make it a little bit easier I'm sorry that I haven't done it yet, but there were a lot of things I wanted to show you I'm going to change the appearance to 150 I'm going to click apply. I don't necessarily recommend you do this But you could I mean you could but And the other thing I'm going to do is I'm going to display only The editor column And I'm going to do that in my case you can do it with these commands up here But I I know the keystroke this control shift one and This is actually the way I tend to work in an rmarkdown document because once I get everything loaded up All those other quadrants are not so useful to me So here I have my view of gap liner And what we're going to now introduce to you. Oh, I'm sorry. I wasn't done time. So it's 1700 rows 7904 rows six columns the column names are listed right there What I get here is The data type of each column. So country is a categorical variable That r refers to categorical variable type as factors And then it gives me a preview of the data and you know, it's a small preview Turns out that the first seven rows are all the same Continent is similar. It's a factor And you get in there so far and then all of a sudden you see it a different category. There's europe as opposed to asia Then there's year It's an integer. So a single number And there's life expectancy and it's a double or a floating point a decimal point number These distinctions you don't necessarily have to worry about when you're new But it's at least helpful to know that r has a rather precise notion of data types The one that we don't see here is character Easily these first two columns could be character, but they're been treated as categories And those kinds of things can can trip you up when you're new But it's not so important to really get into the details right now What we want to do is learn how to subset the data, which is how do we how do we take a six column data frame And just pick maybe two columns Or how do we take a 1700 row data frame and just pick certain Rows maybe all the rows where the year equals 1952 All right, so that's what select and filter do select subsets by column or variable name and filter subsets by row Or what would my call an observation? And then arrange allows us to sort the rows by a variable You'll see what I mean in just a second. So I showed you how to glimpse gapminder I can still just Look at gapminder as a as a rectangular grid the normal way The only the only downside I have is it's a 1700 Row data set and it's only going to give me the first thousand rows So there's another 700 rows here But that's not that big of a deal because I can once I learn how all these different functions Uh, the plier functions. I I can become very comfortable with knowing that I'm Only looking at a part of the data But I only need to look at a part of the data and I don't need to use up all my ram with the whole data set I'm going to need to iterate over the whole data set once I get my my process in place right So there's gapminder and the first question is how do I subset this by rows? I want to use I want to see just the year and the population And I can see right now that I wrote that wrong. I wrote population as if that's the actual name But actually the actual name Is pop So and you can see it right there there's pop And there's year and those are the only two things I want to see So I'm going to type gapminder Then I'm going to put in a pipe control shift m or you could type that out by hand if you want to 2% greater than percent And hit my enter key I'm going to type select And I can use my tab completion And I'm going to type in the variable names if I type really slowly you'll see that after I get past the first three letters It's going to allow me to do tab completion again comma and pop And it turns out that there are two possibilities with pop um, I just want The one that has the little pink flag Because the other one has a little icon that clearly shows Me that that's a data frame an additional data frame of population It's not what I want to look at and with this one we've had completion And when I execute this code junk I have subset on a temporary basis The gapminder data frame to now just be 1700 by two rows Or sorry 1700 rows by two columns It's a handy to know gapminder That you can also select by column position right, so gapminder and then Select Let's say columns two through four and column six All right, so if I run this code junk with both of these You'll see two different Data frames one is the two columns And this one has four columns Generally speaking, I think it's safer to select by name and by position but You can do both and you can enter you can mix and match right I could I could type continent here and That didn't work the way I expected Uh, I'm not sure what I did wrong there uh, but I'm really confused about that. That's what I love about live coding. Is it really? I'm not sure why that happened Year life expectancy GDP. I guess it's something you need to look out for I'm going to go back to I maybe should have rebooted my my computer at some point. Um It's usually not that fragile What am I missing here life expectancy? Huh, that's a stumper That bothers me a great deal It must be because I have something some I don't know what's going on there. That is definitely wrong That's a really that's That makes me nervous um so nervous in fact that I am inclined to Shut this booger down and uh restart it and see if I can Straighten it out Okay, so I'm going to go back into my exercises And I'm going to go back into Now this is the problem with having just one screen is I need to go back to this Oh one here And I'm going to make that one screen again, and then I'm going to scroll down to about Where I left off which was right there and in this time You know, it's just like another little tip. I'm going to click this little funky button Which is to run all the code chunks above which I need to do because I restarted And I'll get a little progress bar down here while it does that And then when it stopped I know that I can run this code chunk And I am super confused as to what's going on there. There's something I don't understand about that command Or they recently updated the select statement And Anyway, I guess it's a good example of you always need to be careful and verify what you're doing with your data I'm going to move on because the point is that's how you subset by column and it for myself, I usually just work with Variable names and I maybe that's why I don't know what went wrong there Uh, all right, so that's up subsetting by column. Let's subset by row We do the same thing gap minder and then filter where Year, which is one of the variables And I'm going to use the double equal sign for equivalency equals 1987 And because 1987 is a year I don't have to do anything special If it were character data like let's say well, I'll come back to that 1987 1980 Now when I run that when I execute that code junk Um Then you can see I now have subset to down just to 142 rows And if I scroll through this every single one of them has the year 1987 Now you can mix and match those things in case you're wondering I can put in another pipe and then Let's see filter continent equals And now continent is a text variable It's a it's a I mean I it's actually a categorical variable, but Um, it's text for sure And so I'm going to put in OC and uh Wait a minute I better not do that because one of the things you're noticing is here we go I always say I'm the world's worst speller. It's amazing. I ever got a job in a library There we go. So there's a combination of doing two filters at once, right 1987 is one filter Oceania Oceana if I'm saying that right, um is the second filter and That works So filter and select Now I want to introduce to you how to sort which is with the arrange function So I'm going to sort by population if I uh highlight gapminder the data frame and hit control enter I can execute just that one line And that gives me a view of the data frame so that I can Do what's being asked here, which is to sort by population and I wanted to remind myself what the population variable name was So gapminder Arrange pop and if I Now if I do control enter it's going to execute that whole expression From gapminder to pop And uh Now what I've got is all of this data listed in ascending order by default Right, so if I go all the way to the end here Or to the end of the memory It's a much larger number than what's in row one um Sort continent and reverse alphabetical order. So I can sort not just numerically, but I can sort alphabetically So arrange continent And it said in reverse alphabetical order This technique of reversing works both for numbers and for uh and for letters I'm just going to uh embed an additional function Inside of the arrange function. So I'm going to highlight my variable this continent and I'm going to Press my shift key down in my open paren key And that's going to wrap the whole thing in another set of parentheses So that I can add my other function embedded inside the first one And I'm going to type in the function descend For descending So that's reverse right descending alphabetical order would be reverse alphabetical order And if I run that Then you'll see that I have listed at the top all the oceanas first and if I scroll through them for quite a while then I get europe and it just keeps on going uh Eventually get to some other some other continent Uh, but another thing to point out is I can subsort right so not just reverse alphabetical order by continent But I can also reverse the alphabetical order by year and go back up here to row one Actually may already be in that order. Let's see what happened. I thought no, I don't think I think it's in the sending order right ocn in 1952 all the way up So let's put year here and do the same thing descending and run that command And now you see how I can subsort Just by adding more commands Uh I think I've said this but I feel a need to to point this out right now that um these changes that I'm doing right now are only Temporary I'm not changing the original data set, right? So if I write gapminder down here one more time And I execute all three of these commands I will get three data frames The first one is arranged by population. The second one is descending order of continent sub sub arranged by year And the third one is just the the full data set full data frame if I wanted to fix This version of the data frame and deal with it later in just that sorted order What I would do is I would use that assignment Technique that you saw me do several times before for example when we opened up the session. I did this John and I forget all the other names I put Layla Layla and Boogie all right, um So I'm going to use that same technique and I'm going to say, um I'm just going to call it uh sorted gapminder And put in my assignment variable I'm going to comment this out because I don't actually want that to run Now I can do that but Again, I'm just going to I'm just going to reiterate when I do that I won't actually see three data frames because When you assign it all you're doing is assigning it. It's not going to display by nature. Right. Let's look at that So now I have the two data frames the first one is displayed This third one is displayed and the third and the second one is kind of invisible. It's actually I can If I zoom back out here And look at my environment variable Um, it's right there and notice by the way gapminder is not in my environment variable Because gapminder is an onboard data set. It's a weird little funky thing But if I want to see it, it's so simple. All I have to do is type it again sorted Gapminder and now when I run these three I'll have three data frames This one anytime I refer to sorted gapminder will always be the same Same particular view Of gapminder. So it might make actually more sense if I did something like this select Country year and pop right sort of making a A compound sentence and so the difference that you can see right here is that It's not immediately obvious these two data frames are sorted, but they're the same size Whereas this data frame Only three rows 1704 by three rather than 1704 by six uh, okay so those those that's a good review of um I see that we're at an hour 40 minutes. Let me see what my time is 330. We got a half an hour to go Is that right? No, we're done. Oh my gosh. Sorry. I was 20 minutes behind schedule all right I don't want to um I don't want to presume that you guys can hang with me all day long and I am gonna I'm sorry that I lost track of time. I will be happy to continue this So that there's only a couple more rows here. I'm not going to get to visualization But um, I will invite you to Sign up for my part two or you have access to all the videos And you can always schedule me for a consultation um This aspect is super helpful because one of the things about visualization is you often Need to get your data subsetted in just the way right you're subsetted or wrangled And so the plier is very helpful for that So i'm going to keep on on that track Sorry, I didn't get to everything Uh, and I know some of you have to go Uh, okay, so um mutate super helpful because that's how you create new variables so the goal here is um To create a new variable called double life. You may have seen an example. We we did an example of this kind of earlier mutate And by the way, um, same goes for you drew. I understand if you can't hang around I'll I'll do my best to manage the chat Mutate a new variable called double life And double life i'm going to use this assignment variable equals, uh Uh life expectancy Times two right Now when I execute that I now have seven Variables which I can see right there and there's double life which is double This right life expectancy times two Uh count can be really handy Count how many observations exist for each country So I could just do something like this count country And this will be a little underwhelming Because it gives me the same answer for every country because this is a very clean data set Right, uh, every every country is listed in there 12 times for the 12 different year time stamps for which they collected population data But one of the things that you can do that like it's an easy way to figure out what countries are in the data are in the What are the variables that exist within the country column What are the values that exist within the country column? So you could use count There's another way to do that. However, which is this nice command called distinct Country, uh, so I can do distinct continent. I'll do continent this time And it just gives me like that's a great way to go. Well, which continents are represented without scrolling through the whole data set Now I see my my folks are dropping like flies. So I appreciate anybody who's still here And of course, there will be the video We're going to introduce the concept of sum, right? What if I wanted to sum All of the population column Well, in order to do that, um I kind of need to know two things one is That there is a function called sum, right? So I could do this sum 5 comma 7 comma 10 and if I run that I get an answer 22 But I want to sum the whole column for gapminder Uh, let's say I want to sum all of population Which of course doesn't make any sense because I have lots of different years here, but It proves a good it shows you a good function I can just type summarize And I'll point out another thing tidyverse And are it was super popular in new zealand. In fact kind of the kind of rock star of the tidyverse kind of handling guicum Is from new zealand. So they have both british and english spelling and you can use either So I tend to use the summarize with an s first because It's the first one and it's convenient, but if you wanted to spell it with a z that'd be fine And then I type summarize pop and you'll see what happens is Oh, I'm sorry. I did that wrong what I wanted to do is type total pop Gets value from the function sum of pop And then I get this really big number turns out to be like 50 billion Which is not a particularly useful thing to do right now because it's out of context over the data That I opened up, but you may have a different column where you do need to get a column total Now That's what summarize does but usually we don't use summarize by itself. We usually use it with a function called group by So I can do something more useful like say Group by year And if I just run this It really won't look any different I just get a different an extra message up here that says I've got 12 groups One for each year one for 1952 57 62 etc But then if I combine that with summarize and do the same thing total pop Gets value from sum of pop Now I have a data frame of population totals for each year across the whole world And of course I could see that going up also note I know this is a really hard number to look at But it is it's a numeric data type And so the fact that it's hard for me to look at Kind of means that it's easy for the machine to deal with but if I needed it to be easier for me to look at I could Do something like this Where I add another function scales Comma so from the scales library use the comma function And when I run that I now have a much for me much easier number to read The only downside to that is it changed my data type. So now it's considered a character And it's considered a character because it has commas in it And the downside to that is I can no longer do math on it, right because it's not It's not numeric so Kind of have to be careful of those kinds of things That's worth knowing about and it almost done Just another thing to point out here and I'm going to grab this Code chunk copy it into my buffer and paste it right here Is that summarize can do more than one thing? right, so I can say mean pop Gets value from use the mean function and type pop And now I have grouped by year total population And the mean population and again the difference in these two Is that I force this one to display so it's easy for me to see And this one is still a numeric double floating point variable. It's easy to do math on And I could easily I could easily visualize that And so I'm I actually I'm going to just real quickly. It won't take me much longer than I promise. I will stop I could send this to example to the visualization visualization package called ggplot And ggplot once as its first argument to be a data frame But we did that up to here. We wrote a data frame So then we're saying and then go to ggplot And then the another argument that I need is the aesthetics argument where I identify the x and y values Of my whatever I'm trying to visualize in this case. I'm going to try to visualize a scatter plot. So my x value Uh could be year And my y value could be mean pop And then I need one other thing and this is where it gets weird. I'm sorry Um, I'm no longer going to use the conventional pipe because ggplot is the first In the evolution of all the packages like tidyverse and it uses a different conjunction uses a plus sign And I'm going to pipe that to a particular layer visualization layer technique to make a scatter plot There are tons of gm functions If I just go here and hit tab I can make an area plot a bar plot a box plot this goes on The gm point is that what I want And if I run that Then I have this Uh nice scatter plot. I could actually make it a line plot Just by changing the gm function And you can see that right now my scale is in scientific notation, which is also hard to read So, uh scale y continuous labels equals scales comma See if I got that right. I did. Yay um so, uh It's so kind of you guys to still be here with me. I'm sorry. I went so long. I don't want to take up any more of your time I will make the very uh video available And I encourage you One of the ways I want to encourage you to use r is to take what you've learned here today You got you always got to kind of start small um Find a project that you've already done like if you're an excel user Don't make it a super elaborate project and just try and replicate what you can do in excel In r And what you'll find is that it'll it'll force you to realize the stuff you don't know And it'll force you to realize how to overcome those things Um, and as you're overcoming those things you're learning new stuff So I it looks like my buddy drew took off and it looks like I have some chat So I'm gonna open up the chat and say I bid you goodbye and if anybody wants to unmute and ask a further question I got nowhere to be Thanks so much Good luck and reach out to me if you have any questions Hey, john Hey, if I could could I circle back to Just the differences between our script And our notebook and our project. I definitely understand the value of our markdown now because you went through that very well But I just don't quite know what the difference between the other three are Okay, um, so our script and our project Uh, and our script is a different way to do an our markdown document and Uh, what I would say to you is unless you're if you're new Coming to r I would say don't bother learning our script. It's like the old school way of doing it But that's how I initially learned Okay, well, that's fine. Uh, and you can keep doing that the difference is is that in our script Um, you can't integrate the pros Right, so let's take something like this section right here Uh Oh, let's make that an oops. I don't know if I meant to do that. I'm going to copy that and I'm going to make it in our script File our script And I'm going to paste it here and I'm going to get rid of my pro Well, I don't have to get rid of my pros but my pros has to be a comment So I can do something like this library tidyverse And then, uh, I don't need this And I also need gap liner library gap minor So, um I don't want to get too like weird or Authoritarian about this like this is fine if if this is what you like Um, don't change on my account uh The problem that with this approach from my point of view is only that What we seem to know about coding is that the more is that when you have to write comments like this um It's so inconvenient that people tend to Uh devote devote short shrift to their commenting and so over time you get what you get are Code scripts that are kind of hard to read Because they're sort of idiosyncratic to whatever you're doing and nobody really goes back and comments it that much Um, it's not that you can't also write hard to read code in our markdown It's just that what our markdown brings And our notebook brings is that ability to interspers natural language with pros and then generate all kinds of derivative outputs Like uh like a dashboard or a set of slides Whereas in this case with our script It's actually a fair bit of work to generate different kinds of outputs If you don't ever want any other kind of outputs, all you want to do is save some pictures I mean really this is fine Um I think what I would suggest to you is I think that you will find if you start using our markdown the more you use it The more you'll like it. It may feel foreign at first Uh, but eventually it will feel very convenient. Um, but again, don't change on my account. There's nothing wrong with dot r Uh It's been around for ages. It will continue to be around for ages Lots of people use it. Uh, but that what do you see here? This is an r script. So if I Then save this I'm generally going to save it as I'm going to call it example Two dot with a capital r Save and let me zoom back out of my So here's my r script with a dot r And with this dot r and d I can create all these kinds of report derivatives But um, but the executable stuff In a dot r and d exists within a code chunk Whereas the executable stuff within a dot r is just there and then you have to you have to kind of go out of your way To make your comments Kind of protected from the execution Um, so that's the only difference. I hope that makes sense um The issue with projects kind of has More to do with reproducibility and set wd than it has to do with the scripts and the selves right, so One thing that you'll see is that a lot of people will do something like this At the top of their screen in in our script in particular Uh, like so if I type this command get wd Um It gives me a response down here in my console of what the Working directory of this project is so If I want to use relative file pass, which is one um Just one Of the best practices for making reproducible code Or reproducible projects then I would in this context. I would have to say set wd at the very top of my um script And I would have to set the working directory to an idiosyncratic location That is specific to my particular computer And what that means is every time I run this I can then I can then do this read among other things read csv um data All right, I'm going to call this uh my data to I can do I can then do this so I Everything here is now uh data is a sub directory of this working directory The problem is and the reason to to get out of that habit Is that if I share my code base with anybody else, they have to go and find Wherever the sent wd command is I mean ideally it's up at line two But it might be at line 10 or it might be at line 15 And I might not even know it's there and I'm always going to have to troubleshoot your code If you give me your code in that in that manner and I'm going to have to Make that change specific to my computer Now that's If you don't mind me saying in my mind that's sort of pernicious enough or if it's not pernicious It's a hassle. It's kind of a pain Uh, but that's you know, well, so what like you might go so what john? I'm not going to share any code with you fair enough um But who you are going to share code with is yourself, right? And what I mean by that is you're going to replace your computer sometime And if you've got all this code base from your last computer Unless you're very careful about moving all of your projects into the exact same file locations Then you're going to have to go back and re and not only rewrite set wd, but anything else Where you're using explicit file paths To reference things like saving outputs and whatnot So if you get in the habit of Not using set wd, but instead using an r project Um, then what that means is by default the r project Is just going to bring this along no matter where it exists so When I mentioned to you that on on github you could grab this repository You can click the download zip you can expand that and when you expand it Let me go back to the I expanded it from my onto my desktop And now if I run that I want to do I'm going to do is I'm going to shut this down I don't know that I want to save those things so I'm going to say don't save And I open the project this way I haven't like I'm actually working although it's probably not obvious I'm actually working on two different computers during this session One is my home or it's actually my work laptop and another is a virtual windows machine So they don't have the same file structure and Now that I've used this as an r project If I just type get wd right here And run that command That's telling me that this project exists on my desktop But if I go back to my I'm going to switch here to my um On my home computer And I run this project and I type get wd I don't know why I have my caps lock on You can see that this project actually exists in an entirely different place the documents folder versus the deck desktop folder so Even though there are different places it's all going to run because I've been using relative paths So I hope that begins to explain why you want to use a project Another reason why is because I can have multiple projects open at one time So I can go through my my short list of recently opened projects I can go down here and I can open up this Attendant sheet workflow if I click here It's going to replace the one I have if I click here it's going to open it up and now I have two projects And they're actually going to share different environments So I I have no concern Of running the risk of what if I use the same variable name in both of those projects? Do I need to be careful about? What order I run scripts because I might have accidentally overwritten a variable name It doesn't matter because they're Because they're in two different projects. They're two different distinct Sort of RAM spaces if you will and they're not going to bleed into each other um I think that's the sort of the short of it I don't know if I'm making the case compelling But I hope that I'm explaining why why you would do it Um, and if you have follow-up questions or if I still haven't gotten to the null of it Let me know and I'll be happy to give it another go Now that's that's very clear. Thanks so much Yeah, thank you john that makes a lot of sense. I guess um, I saw it in the beginning of your first video how to You set up to you made a new folder First and then you made that new our project. So you need to do that Step in order to have this like two different our project spaces. You can be running at the same time and all that So not really. Um, that's one way to do it and it Uh, I realize that I'm probably throwing all there's like you can do it this way or this way or this way And that can be kind of like I don't get it which one am I supposed to do right? Um, but The easiest way to start a new project is to go up here in the upper right hand corner and choose new project I'm actually going to do this on the virtual computer because um Oh, I forgot. I don't know if you guys saw everything I was doing. Do you see a blue screen right now or a white screen? We have a white screen right now Okay, well, sorry. Let me go back to my white screen. I just showed a whole bunch of stuff that probably didn't get trans But um, all right, so up in the upper right hand corner you see intro to our exercises dash master and if I click on that This is what I would normally do is I would normally actually click new project right here And then a wizard will come up. You'll saw that there was a little delay And the way I showed you in the video was with it was was applying the project status to an existing folder Where I had already moved stuff into it But for me it's actually easier because I'm always in our to just start a project by clicking new directory So I can start that here Go ahead. Sorry. Does it make a new folder? Yeah. Yeah. So when I click on new project I can now call this um Uh, if you'll forgive me, I always Call things that I know I want to delete I always start off by calling them del me That way I know when I run across it that I did this for some workshop and it's probably no longer relevant, but I'm going to call it del me Anna Anna and john uh analysis now By default It's putting it in my in wherever our studio wants to stick all of my projects And that's what that tilde means that tilde means Wherever you're sticking all your projects, but I could literally put it anywhere, right? So I'm going to put it on my desktop Again, like I did before And that's going to change this but it that's you know, that's just whatever you want to navigate your file system And then when I click create project Um, it's going to close that project that I had open unless I had checked a different checkbox And now I have a new project. So now when I minimize everything Here's the project I was working on earlier today. And here's the project that I just created And it doesn't have anything in it except this one file So you can do it that way you I find it more convenient To start the new project from within our studio, but that's because I'm always in our studio Um, and I don't really have a preference like The other thing that's probably worth noting I didn't show this Uh but Let's see. Where's my Where's my web browser Hopefully did you you did just see my desktop, right? A second ago Okay, good. I'm never sure what I'm doing on zoom to be had such such a weird environment. Um so if I go back here to github I didn't show this before but What I showed is the always works method where you download the zip file But if I grab this thing right here by clicking on this clipboard which gets this code I can then create a project in our studio With just that so I'm going to go back here Now I'm in our studio now I'm clicking our new projects Who wants to save some stuff? I'm going to say don't save And I'm going to and then I'm going to go down here to version control And choose get And paste that thing in that I just put in my buffer And now and at this time, I'm going to click open in a new session so that I can keep them all both open and click create project and Now what it's going to do is create a local project From a github project um So I realized that this can be kind of weird, but you can think of I mean to me. I guess the analog is Uh when I used to use microsoft office all the time Uh, you know, I was writing reports for different discrete things. Let's say classes. I have sociology class and uh, and uh And a math class So in my documents folder on my directory, I would have well I'm not going to say math class because I wouldn't have written a report for a math class No, so it's a history class In my documents folder, I'm going to have one folder called sociology 101 in a different folder called You know history of the spice road, right and I'm not going to believe different parts of my report into those two folders. Those two folders are discrete And that's all in our studio project really does is it allows you to keep discrete things on your file system and Then leverage stuff like the relative file paths and stuff like that Uh, so you can put them anywhere you want but in practice I have like a godzillion number of um I probably won't see it on this computer because this is a virtual computer, but if I open up The file manager yet another view of the file manager um I have a ton of projects in my documents folder and they're all Our studio projects I mean, they're actually different. There's some zoom projects and some panopto projects and stuff like that, but most of them Are in my case these days just our studio projects That's really helpful just the whole setup that was I mean, I rerun I rewound the first part of your video because we do um a lot of uh Data manipulation in order to start our analysis and do our analysis. So we're running a similar code, you know over and over again and it's just um The organization of this is really helpful, you know to have it all in that one project And then you're getting to see them pop up like that's just that's great Definitely going to steal it. Good. I'm glad that's exactly the use case like it really is all about organization It's about, you know getting the same results With all of your collaborators Right Yeah, thank you Sure, sure thing Hi, john. It's dawn here Hi, don I I just had a real quick question. Um, I do I write a question in the chat. So I'll just uh Oh But um, okay, I work with the microbiome data And it's composition. So at some point in time do I need to change my feature table or biome file or um metadata files into a matrix Ah I don't think I can answer this question Okay, um Well, I don't want to say definitively that I can't answer the question I don't think I can answer that question right now without kind of seeing what you're doing and understanding Because your area of expertise is not mine. Like I'm much more involved in social science and I don't really Totally get what you're talking about But I would be happy to have a consult with you. Have you show me what you're doing? We could share screens on zoom and I could I could try and I will I would certainly give you my first impression uh My off the cuff answer is for myself I tend to never use matrix matrices um, I like the convenience and the simplicity and the sort of metaphorical simplicity of a data frame uh, but uh Sometimes you have to do matrix algebra and you have to use a matrix right and so kind of depends on your your context uh Data frames which uh in a tidyverse context can also be called tibbles Uh, they're really super convenient, but they are the people who developed them are the first people to say A data frame is not the right format for everything Right, it is it is a great way to organize a lot of things And what makes matters even kind of more complicated is in r You can have data frames or you can have vectors or you can have lists and the lists um kind of always Is sort of managed to boggle my mind no matter how many times I look at them. I go I I don't know what this thing is And and the reality is what's really crazy about it is that actually everything just about everything in r is actually a list So a data frame Is technically a specialized r list um So for ease of of my mental model of thinking about the data I like to I like to put it into a a grid rectangular grid As a data frame as often as possible But I would never say like you must use a data frame because it really depends on context and You may be running across packages that you're using Where that's a requirement to have it as a matrix and there are some convenience factors, right? You can You can use commands like as tibble or as data frame and as matrix And shift them back and forth. It's usually non destructively So you might want to shift it into a data frame to look at it but into a matrix to process it right Anyway, long story short Feel free to set me up with a consultation and we'll take a closer look. Okay. Thank you Yeah, all right folks um I probably have uh, probably should Turn the meeting off, but if anybody has one last question, let's go ahead and And and answer that and if not, I couldn't thank you enough for your time and attention today. All right. Take care