 And this is going to be a demonstration of how you can leverage R, really how you can extend your knowledge into different domains if you know the tidyverse. And there's one practical example of how you can use tidyverse tools plus a few other packages to start gathering Twitter data. There are slides in there. And just to show you where if you download this, like with the download zip button after you put the green button, there's a directory in there called Slides. And if you download and expand the file, you can run index.html within that directory. I like to start by reading this land acknowledgement. So if you'll give me your attention for just a moment. Duke University sits on the ancestral land of the Shikori, the Eno and the Kataba people. This institution of higher education is built on land stolen from those peoples. These tribes were here before the colonizers arrived. Additionally, this land has bore witness to over 400 years of the enslavement, torture and systematic mistreatment and African people and their descendants. Recognizing this history is an honest attempt to break out the unpersistent patterns of colonization and to rewrite the erasure of indigenous and black peoples. There is value in acknowledging the history of our occupied spaces and places. I hope we can glimpse an understanding of these histories by recognizing the origins of our collective journeys. So that's a very serious thing. There are very, there are other very serious things that have gone on in the news today. We're not going to discuss those serious things going forward, but I hope that some of this information might help you fix some things or injustices that you see before you. In any case, today is a demonstration and the goals of the demonstration are to gather some tweets. In order to gather some tweets, we're going to define what an API is and then talk about the Twitter development portal, which is a particular API access point. There is a way to request academic use of the Twitter development portal. We'll talk about that. We'll do some very rudimentary text analysis and visualization. I got one question from somebody specifically about visualizing geolocating tweets and we will definitely talk about that. And I will point out some useful documentation along the way. The one thing that I like to make really clear is that a lot of what happens when you analyze tweets falls under the category of text analysis. And text analysis is a very dynamic and changing field. I am not a text analyst expert. I don't pose as a text analyst expert. I'm just an R slash Twitter slash or R slash tidyverse enthusiast. And I'm trying to show you some practical tips on how you can get started. But the reality is, is that there's a tremendous amount to learn when it comes to text analysis. And I will try and point out some books, some free books that you can get online to help you get started. It's possible if you've done text analysis before that you know more than I do. So I just want to make that clear that we're really just, my goal here is just to demonstrate some useful R code and give you a starting point where you can take your research further. In terms of a starting point, I like to point out that I do a lot of our workshops, introduction to tidyverse kinds of workshops. And so I'm going to click on that link and go to R fun in case you haven't seen it before. This is a sub branded site for my center that focuses just on R. If I start to cover things that you don't recognize or need a little more background on, you can possibly get them from here. If you scroll down, there's a section called our workshops. You'll see different modules on lots of things. Like, for example, we'll talk a little bit about mapping as we map Twitter geolocation data. There's more information there. We're going to do some visualization more information there. This quick start guide is really helpful. What I did in the last year is I flipped my introduction so that it would be more convenient and useful in a COVID environment where none of us are meeting face to face. There's a maybe maybe it's a 20 minute video right there, the quick start. There's maybe a 15 minute video right there on the plier, which is all about reshaping data, getting it in a format where you can then analyze it. This is going to be really useful as has been pointed out many times and documented in the New York Times at some point in an article that usually about 80% of any good data project is just about the normalization and cleaning and reshaping of data. That's what you would use to plier for. There's a video here on Gigiplot. I want to bring to your attention this section right here in the middle. Lots of things to learn more about in terms of assignments and pipes and joining and pivoting data are marked down. There's a nice link right here that is to a playlist of all these short videos and you can play them at any time. In addition to that, I could give you even more pointers to other useful learning materials for the tinyverse. One that I'll just mention, you can go to your favorite search engine and put in the phrase RStudioPrimers. Send me an email if you need the actual link, but I think you'll find it. That's a really nice interactive place to learn and build up your R slash tinyverse skills. Going forward, I'm going to start by talking about application program interfaces for APIs. APIs have existed for a long, long time. They've existed before the web was a thing. It really is essentially a way for a machine to talk to another machine to do a certain thing. Now, bring this up because if any of you were in my workshop a couple of weeks back, excuse me, where I talked about web scraping, which is really a form of screen scraping, a lot of people will come to me and they'll say, well, I want to scrape some site. For example, Twitter, because it has a web front end and you can browse it through your web browser and use your mouse to scroll around. That is technically something that you could possibly do, but most sites don't actually want you to use a robotic mechanical or not mechanical, but a computer-based robot to scrape the web front end to their site. They want that to be the front door that's responsive to people who are clicking and browsing manually. As a result of that, there are other ways to get at the same data. There are literally thousands of APIs and you can usually Google. There are a couple of clearinghouse sites that will list APIs, but you can usually use a search engine. For example, if you wanted to find an API at the CDC site, you could just search that in your search engine, like CDC, COVID data, API, maybe financial data, anything like that. There is a web front end, I should say a big data source web front end. There is a good possibility that there's also an API for that site. You can think of, or I often do, think of the API as kind of like the loading dock or the back door entrance, right, like if the web page is the front door. They want that to be pretty. They want everybody to walk through in the right kind of manner. The API is really for bulk collection of data. In a way, you do sort of the same stuff. If you were in that previous workshop where I'm talking about web scraping, first you have to gather the data. You have to figure out how to traverse through the different options of the data to pull it back. And then you have to parse it to separate out the markup language from the text. Or sometimes you want elements, attributes of the markup language. And a lot of times you're going to end up doing the same thing with an API. Most APIs these days will ask you, well, first of all, you'll have to register with them. Almost all of them, you have to register. They don't all cost money. Some of them do. So the registration process is often free. Exist documentation there. Every API is going to be different. So you're going to have to look at that documentation, search for forums and places where people can give you help. Because when you register, they'll give you your personal set of access keys. One of the keys you need to keep secret. So keep that in mind. If you're putting it in your code and you're putting your code up on GitHub, you want to make sure that you're not exposing your personal API key. The other key is more public. And so just that's something to keep in mind. Twitter is no different, right? They have a development portal and you can apply for access to that development portal. Today you won't actually need to apply to the development portal. If we get far enough along and we start using code, just your standard Twitter account should work. Twitter rules about accessing their API through the development portal actually change pretty rapidly. And this is a relatively recent change. Some of those rules are quite inconvenient. For example, up until recently, you couldn't get very much historic data without paying some kind of fee. Very recently, last couple of months, they opened up a new sort of policy channel to their portal that they call Academic Access. So you can write, you can create a Twitter account. You don't necessarily have to use your main Twitter account if you're a regular Twitter user. And you can apply for that academic access. It's kind of exacting. It's not going to happen in two minutes or three minutes or five minutes. It can take a week. You have to fill out certain information. Just be very clear about what you're intending to do. You know, I'm collecting data for this research purpose. I'm going to publish it or not going to publish it, whatever. And then keep close track on your email because that's the way they're going to respond. And then eventually they will give you access. So if I, let me just go to that link just so you can see it. There's the button that you would click to apply for access. Now you can see that I'm already logged into my Twitter account, which by the way, I really don't use that much, but I use it enough that it will be useful for today. This is a view of my view of the development portal. And you will notice that I have a couple of what are called apps. I'm not going to open them up because some of these probably have my keys in them. And I don't want to, I don't want to open them up and expose that to the video and then have to go and edit that back out. But that's the information that you may need if you have to go back beyond the standard thresholds that Twitter is going to impose on you. And I also wanted to point out, let's see, I think it's right here. If we click on Twitter API, that products, maybe products, let me go back to my slide. Oh, it's right here. I just didn't appear at the top. You can see that I have a standard development portal access that I'm using. And I've also got this academic research portal, and then there's, which I think I already applied for, but then there's this button here that says get started. And when you click on that, you'll have the ability to fill out the forms to get yourself approved. And then you'll be able to do much more historic data research. Your thresholds will be quite different. So I encourage you to do that if you're just exploring for today, you're not going to need to do those things. And if you're intending to turn it into a research project, it will take a little bit of time. All right. So going back to APIs, like again, the Twitter development portal is a Twitter specific API. A lot of APIs return their data in what's called a JSON format, right? So JSON technically stands for JavaScript object notation. You don't have to know JavaScript in order to use it. What it is is embedded structure to the data that they return in a key value pair kind of format. So once you get this data back, if I wanted the first named value, John, I would need to parse through my data record, which starts with this open curly bracket and ends with this closed curly bracket and index on the key, which in this case is first name. So if I wanted age, the value, which is 25, I would have to index on age. Now, there's a package here called JSON light. If you find that you're working with some JSON data from some other API, you may need to use JSON light so that you can parse the JSON data effectively. Basically, the reason why people do this, why sites do this, is because they can return large amounts of data to you with the data schema that's relevant to the record. And it's not quite as rigid as, for example, a relational database where every table is going to have to have the same fields. In a JSON file, if we take this example right here, there's a phone number key, which is itself an array that has different types of phone numbers, home, office, and mobile. I may have no phone numbers, in which case, technically, I wouldn't have to return that part at all for one record, or I may only have one phone number and then I don't have to return an empty string where there's no home number, or etc. Anyway, JSON is just a very flexible data, so what I'm looking for, data schema structure to talk about to tell you what kind of data you're getting back. Now, fortunately, we're talking about Twitter here, so we're going to use this R-tweet package. And one of the things that R-tweet will do for us is it will turn the JSON data into a data frame. So we can largely ignore this for today, but if you're accessing other APIs, that will be useful information. All right, so R-tweet is developed by this group called R and Psi. They keep it up to date as up to date as possible. You can go to this link and that will give you access to the documentation on all the functions. Two particular articles that are useful at that site, there's a submenu that I think has the option of articles. One is obtaining using access tokens. So I talked about those tokens back here and these keys. You might want to double check that. And the other one is this very useful article on an introduction to R-tweet and collecting Twitter data. We're basically going to run through that today. As I mentioned before, once you get the data back, you're going to want to do some kind of text analysis. Text analysis is a rapidly changing field, but the package that we'll use most likely, or that we'll use most today, it's a package called tidy text. It was developed by Julia Siligy and David Robinson. There's a free book called text mining with R that you can get to on the web, which I find very helpful. It's concise. It's specific. It has some appendices, appendix in particular, a case study on comparing Twitter archives that I would recommend to you. Another thing that I would recommend to you is this link right here to this site. SICSS stands for Summer Institute for Computational Social Science. It is an institute that is the brainchild of Professor Chris Bale, who is in Sociology Department at Duke. And another professor up at Princeton, Matthew, I'm sorry, I don't remember exactly how to pronounce his last name. But if you go there, they have under the curriculum section, they have several videos on text analysis. So beyond what I tell you today, they're going to give you more of the theoretical background between that and text mining. That should give you a really solid start moving forward. I'll try and loosely introduce the topic as it relates to tweets, recognizing that, you know, when it comes to text analysis, that's a field that has been around for a while, and it, you know, may have more typically dealt with documents, anything from books and novels to, you know, text coming out of NGOs and things like that. Tweets are yet a whole nother weird application of text analysis, because the language tends to be more informal. There are a lot more abbreviations, there are hashtags, all kinds of things that make text analysis in Twitter unique and challenging. But that's where I would, I would check out this case study, and maybe some of this curriculum to get a better handle on some of the nuances. Having said that, what I would like to do or invite you to do if you want to code along with me is go to this GitHub site and download the code. And I'm just going to go straight into my RStudio. Okay, so that, what you should be seeing now is my, my blue screen to RStudio, open to that repository that I showed you. And I'm going to start here in this file called O1GatherTweets.RND. And what we're going to do, let me do one or two things just to make it easier for you to see, change this size of the appearance of the interface. And I don't really need this right hand part of RStudio, so I'm just going to, I'm just going to expand full size, this RND document. There's some text and other links in here that you can read through. The two packages that we need at the moment for gathering data are tidy versus part tweet. So I'll load those. I have the warnings turned off, so it didn't actually respond to me. And I see Denny's here. Denny may have seen some of this before because I did a presentation very much like this, but he organized. And in that, I had recently become aware, this was just six months ago, I'll be high the times, but I've recently become aware of this, this K-pop group called BTS. So I wanted to see what I could learn about them. And, you know, many of you will know the group as well as the song Dynamite. So what I'm doing here is I'm just searching on that hashtag and limiting the tweets to a thousand tweets and using the flag that says that I will, I don't want any retweets, right? So I want only original tweets. And when I run that, let's see what happens here. Okay, so the first thing you did is it popped me into the web browser and told me that it was authenticating and authentication was complete. And so I can close that. You probably didn't see all that because of the way Zoom is working. But if you hadn't been logged into Twitter, you would probably have to log into Twitter. And then you would get a progress bar on what you're returning. So now I have an object in my environment space called BTS Dynamite. And like I said, because we're using the RTweet package, we don't actually have to look at JSON data. We can get this data back in a data frame. So if I display this data frame, I have that thousand rows. And you'll also notice it's quite extensive, 90 columns. It's quite a bit. Consists of some user IDs and some status IDs, and you can scroll to the right, but just for the heck of it, rather than scroll to the right 90 columns, I'll use the Glips command. Just in case anybody didn't pick up what I just did there, I had taken that. I had removed the R designation from the opening code chunk just so I could not have to display this when I printed it out. And I just hit a backspace, making that an active code chunk, and then I could execute that code chunk. And you can scroll through and see that for a great many of these variables, there is not any data. But there is quite a number of variables, including, for example, a thing called Bbox, which is coordinates that relate to geographic designations, such as geocords. People have recorded them. There might be coordinates. If they haven't recorded them, there would not be. All kinds of information here that you, it's almost mind boggling. But most people, a lot of times when they're talking about analyzing tweets, of course, when you get to the tweet itself, which exists in a field called, or a variable name called text, right? So I can't read that. My friend from Brazil who's online here, maybe he can, although he speaks Portuguese and I don't know what language that is. But in any case, that's not important. This is common. You will get all kinds of information back. Could be in multiple languages, especially in a case like this. And you will occasionally get things like this that you see right here, which are UTF codes, I'm pretty certain. And this is a big stumbling block with not only, with text analysis in general in R, but tweets also, is that not everything comes back as text that's easy to read. And you'll have all kinds of codes that you have to clean up. And you're going to spend a fair amount of time Googling and searching for ways to figure out what this UTF code is and whether or not you want to retain it or just drop it. We're not going to worry about those things, but it's just something that you'll have to keep in mind and goes back to what I was saying, right? 80% of any data project is the data cleaning. And there's plenty of data cleaning when it comes to Twitter data. So if you want to code along with me, I would suggest you put your cursor at the beginning of line 63 and hit a backspace and put in anything you want in that blank line and gather a tweet just to see how it works. In the meantime, I'm going to read Brenda's question. She said, I once spent several days making a UTF code to emoji library. If you have tips or resources on that, please do share. Yeah. I don't have any doubt that you spent quite a bit of time on it. I don't know that I have any tips to be honest with you, but if you run into a really specific problem like this and you haven't solved it, I'm happy to help try and point you in the right directions. I have some ideas of how UTF works. I may know a little bit more than the average person, but that particular problem right now at the top of my head, I don't know the answer to. All right. So there we have it. We've gathered some data. I put mine in an object called BTS Dynamite. If you put in a search, you will also have an object called my gather tweets. And oh, this is a point worth mentioning is that if you're trying to collect data, a lot of times I'll tell people like the easy, well, this is a topic that has changed. I mentioned that the policy rules for accessing Twitter keep changing. And recently they changed them and gave a lot more permissions to the academic portal status that you have to apply to. But prior to that, if you're just a standard person and you wanted to do a historical analysis of some tweets, you really had only two options. One was to pay for historical access. And what was really annoying about that is you couldn't find a single site that would quote you a price. So you had to propose what historical search you wanted to search, what the boundaries of the timeline were, and then see if they were going to charge you or maybe give it to you free because you're an academic or what. But there was no straightforward way. It felt like it felt a little uncomfortable, like you weren't being treated completely fairly. On the other hand, you could get the data for free and who could argue with that. Now that they have the academic status in theory, you can do more historical tweets. The other way that you could do it was to anticipate what you were doing. So for example, if you're going to do some analysis on the 2022 elections in the United States, you could start collecting your data now. In order to collect your data now, one of the things you're going to have to do is run that search, this function right here, on some serial basis. So you're going to have to have some computer that runs either in the cloud or off on the sidelines that's always running. And here are two mechanisms. There's probably a MAC mechanism where you can schedule your R scripts to run on a regular basis in the background. There's also a tool that I recently learned of called Apache Airflow. I don't know how it works. But all of those are things you're going to need to keep in mind. If you're gathering data going forward and you want to put it either in a database or some kind of data store, you need to run on a regular basis. You're going to have to run something like this on doing some kind of scheduling task. So there are a couple RStudio instances that are up in the cloud that might make that even a little bit easier. But in any case, keep that in mind. All right. So just further tour of the kinds of things that RTweet allows you to do. This is my Twitter handle right here. Again, I don't use it a whole lot, but I can use this get friends function. And you'll see what it would return. This is going to tell me, I think, who I'm following. And if I run that, this is what I get back, right? A two dimensional table that consists of the user, which is me, and a bunch of user IDs. And that's it. So then you have to do what's essentially do what's called rehydrate, right? So I have to take this user ID, which is unique to somebody that I'm following, and kind of hydrate the information back on who that person is. One reason for that is because people can change their Twitter handle names. But from a database point of view, you wouldn't want that. Just because I changed my name, I might change my name for, I don't know, performance reasons. But that doesn't mean that my historical tweet presentation necessarily should be lost. So that's why they assigned an ID. Anyway, there's a useful, very useful function that can deal with that, right? I can send this whole vector user ID to a function called lookup users, which I'll do right now. And when I run that, I get back, in this case, some more information. Again, 90, 90 columns of information about each user. I conveniently, I don't think I did this on purpose. I'm not sure how lucky I am that it came back this way. But our open side is the Twitter handle for our open side, which is the author of our tweet that I'm using right now. So that's kind of fun. And then a bunch of other people that I happen to follow on Twitter. And one of the things that you get in this particular data frame is the last tweet that they sent. Okay, so not their whole timeline, but one representative recent tweet for each person who I looked up. And you can, what could be included in there just to go back to, sorry, forget who asked, but geographic information. Let's do a glimpse on this and see if how many people have, whether or not we can find any people who have geographic information. I don't see any right off the bat. There also might be geographic information in the, in their profile. But in any case, we'll get to that. All right, so that was get it rehydrating more information on users. And then you can do the same thing on followers. So who's, I think this is right, who's following me. Yep. So if I run that, and by the way, there are sometimes permission issues that you'll run into. You can't necessarily get all of this information for people, for other people. Some of it you can, some of it you can't. But if I run that and I find out who's following me, and then I look up the users on those user IDs, then I can display that data. And here's a bunch of people who have, for whatever reason, decided they wanted to follow me. And this should be their last tweet that they sent, right? Full of all kinds of funky stuff. Like, of course, that's a Twitter handle. And that slash N, if you haven't seen it before, is a character, or it actually stands for new line, which is essentially a carriage return or an enter key. And then other things. All of that, it's the same 90 columns. All right. So one thing that's really interesting is the get timelines. So you can get timeline information on any one individual. There is an individual named Breonna Gibbons. I don't know if any of you know of her, but she's pretty famous musician who I believe she went to high school in Greensboro. And so I'm going to get a timeline on her and Greensboro, North Carolina, and I'm going to get a timeline on her and limit the timeline to 3200. And let's see, that might take a while to run. And I'm pretty certain in the console, I've got a progress bar. Hey, John, what does 3200 stand for again? Yeah, that's just limiting to how many tweets I've got to return. Okay. So you can change that name. And just as long as you ask that question, let me zoom back out. Just let me remind folks that if you're not sure what arguments and functions you can, what arguments or how you can use a function, you can always highlight that function, get timeline and hit F1. And that kind of information should then exist in the documentation. So if I scroll down here, n equals number of tweets to return for timeline. And then, of course, one other way would be go directly to the R-Tweet website where they have that same information online. So I got that timeline back and I could look at it. It should look pretty similar to you at this point, 90 columns, but in this case, 3196 rows returned, which is pretty close to 3200, right? There's four of them missing for whatever reason. And then what I think I'd like to do, what I'd like to find out right now is, well, since it's referred to as a timeline, is what's the range upon which Rihanna has been tweeting, so I'm just going to take the minimum value of the created at variable and the maximum value, which are both dates. There's created that right there. And that just really simply allows me, that's tidy motion there, right? That's not, that's the plier, not R-Tweet. And what I can read from that is that Rihanna has been tweeting from October, late October, 2015, right up until today, as a matter of fact. 1647, I'm not sure what time zone. Actually, I think she lives in Scotland at the moment. I'm not certain about that. And then, all right, so there's my range. That's just for my own personal information. I was just curious. So let's just visualize that, right? Let's visualize her timelines. There is, I just want to double check, yeah, this thing right here called TS plot is an R-Tweet function that stands for time series plot. And then I, and that actually turns back a GG plot visualization that you can see right there. It did everything for me. And then, since it happens to be a GG plot visualization, I can then follow that with a whole bunch of GG plot functions to make it look however I want in GG plot. So that's a really handy thing. That's the exact same graph, just Tweet to be more to my liking. And I'm going to change one more thing because I want Rihanna Givens legend to be, let's see, we're not at the bottom but the top. And so that's a really nice feature. Not only can you easily visualize the timeline, but you can, if you've learned GG plot, you can easily modify that visualization. Let's see. All right, so another thing that I can do is I can do get favorites because people on Twitter can favorite a Tweet. So I'll use the Rihanna Givens handle and see what are, if there's 3,000 things that she's favorited, I'll pull all those back. And I bet there's not 3,000. Turns out there's 412. And again, very similar to what we've seen before, 90 columns. I'm not sure what the one extra column is in this case. Maybe it's the favorites. And it might be at the end. I'm curious about that just right now. So I'm going to, I'm just going to run this glimpse and see if I can see something, yeah, favorited by, got strapped out of the end there, favorited by Rihanna Givens, which is not actually particularly useful information in this one case. But if I was going to combine it with something else, I would want to know who's favoriting what. So there's Rihanna Givens favorites. Um, another thing you can do is you can search for users profiles, right? So there's this search users tag. And what I'm going to search here is this phrase Gullah. Gullah is a, I might not get this definition exactly right because I'm not, I don't study languages, but it is a, either a pigeon form of English, but it's, it's not strictly English and it's spoken down in the coastal communities of North America, specifically to slave populations, enslaved peoples who didn't initially give up their, their language and turned it into something unique. It's called Gullah. And so I'm going to search anybody who's got Gullah in their profile. And once again, return, you don't have to put in the n equals 1000. I'm just doing that to make sure that these, these searches don't run on forever. And so I, now I have that in an object named called Gullah. And if I return that, I get 500 rows where these particular screen names have Gullah in somewhere in their profile, which if I knew where that was in this, it might be actually a list, but we can probably find it, but it's not all that important. It's going to be one of these variables. And it's not going to be in their most recent tweet necessarily, right? Like it's not there. It's just in their Twitter handle profile. Another thing you can do is you can search on trends. So you can, in that case, it takes a location. So at that point, I was just curious whether or not we could get that trend for Greensboro. I myself have not found that, this particular feature all that handy. But, and I'm not even certain that I'm using it, right, to be honest with you, but that's what it's supposed to do. So I might need to double check the documentation on that. But now let's talk about location information going back to what somebody had asked. So in order to use location information, I'm going to bring in a whole another package, this package called tidy geocoder. And you can get a link there to the tidy geocoder package that you can run. There's documentation there. But basically, I'm going to start with this Twitter data, Rihanna Gibbons timeline, right? If I display that. And then I'm going to, you know, drop any nays and place names and select just certain variables and then take only the distinct rows where that exists. I'm going to get back these, this now list of 16 rows where Rihanna Gibbons specifically named some place name, probably some place name where she's performed. I actually haven't gone back to look at the tweets. And then what I did is I created a new variable called address, which should be over here. Oh, I haven't run that yet, which is going to be specifically formatted so I can send this phrase to the tidy geocoder library. And when I run that whole thing, it's going to go out and look at each one of those place names and return for me the latitude and longitude of the place name that was identified. So it can sometimes take a little bit of time because it's got to go, it's basically orchestrating a whole nother API at a whole nother site. And it's specifically using the open street maps geocoding database. And it returns these last two columns for me that I didn't have before, latitude and longitude. Once I have latitude and longitude, I can then map those locations because there's location information in lots of places. There's a place name, there's a full place name, there's a city place type or POI. I'm not sure what that stands for, political something. Country, country code. What's the best thing to gather? And then there's also potentially information in a person's profile. And that was the question was, how do you assess that? And I honestly don't know. I think what you're going to find, what I have found when I have looked into this is that, first of all, there's a whole lot of missing information because not everybody wants Twitter to report where they're tweeting from if you're trying to locate that. And the other is, there's very little chance that you can verify that whatever place name they're identifying is real. It could just be like some kind of Twitter ruse. So I'm not sure how you identify what's best. But once you get that information and you send it to tidy geocoder, you get back the latitude and longitude. And once you get back the latitude and longitude, you can visualize it. And in this case, I'm visualizing it with GG plot, some code that I'm pretty certain I got off of the R tweet website. And all it's doing is identifying these places where Rihanna Gibbons has performed, I'm pretty certain that's why she mentioned them. She's a very accomplished vocalist, I think opera singer, and among other things. And I thought it was kind of interesting that Greensboro North Carolina would show up in this list of these other places that are probably better known for their performance halls. Any case, here's another example. So that was Rihanna Gibbons timeline in places that she mentioned. But I did the same thing for that gullet set of tweets. Same exact procedure, just to see for people who had identified gullet in their profiles, where we could visualize where they are. And I got back very few responses, some duplicate information and some stuff that I couldn't easily look up a geocoder location on. But if I visualize those using another mapping technique, in this case, I get Queens, New York, Atlanta, Georgia, and Los Angeles. And I probably could change a variable to get a few more. So I change an argument to get a few more displays. At this point, I just want to mention, I know I've done some rather quick visualizations and gotten into some mapping visualizations that you may not have seen or followed. So I just want to go back here to Rfun, if I can, and point out where did Rfun get? There it is. That there is at least one workshop on mapping right here. And not only that, but if I go all the way down to the bottom, I'm going to go to my center, Data Visualization Sciences. Why did that not happen? Oh, I'm going to click on this globe and click on online learning because you might find that the R mapping video that's here is more useful because that's done by, I'm not sure, I might be referring to that one on the Rfun site, but this one is definitely done by Drew Keener. And he's a GIS specialist whereas I am not. So he might be able to cover some of the details better than I could of how you do mapping in R and geospatial analysis. So what I thought I would do, I should pause here and make sure that I, if I covered something too fast, if you have an opportunity to ask a question, feel free to either unmic or put it in, put it in chat. And I really am going to pause, I'm going to keep on chatting unless you say something out loud in which case I'll stop talking. But if there's nothing that comes in, I'm going to get into text analysis of these tweets. I have a question being a relatively new R user. What's the, what is like using R Markdown first R script like a preference or is it easier for Twitter? It's a preference. When I teach R, I always teach with R Markdown, but you can create, generate scripts in many ways. You can just do an R script, you can do an R Notebook. R Notebooks, R Markdown are essentially the same thing. And they're both examples of what's called literate coding so that what you can do is you can integrate your pros or natural language with your code chunks. And I'm actually, wasn't intending to, but inadvertently I'll demonstrate because the next script I'm going to run through is this one called Analyze with tidytext.RMD, right? This is, this is that script right here. It starts O2. When you have an R Markdown script and you run the whole thing by clicking on preview, you can generate different kinds of outputs. Could be slides, could be websites, could be web pages, could be PDF documents, or just the list goes on. And that's useful because I may want to use my same analysis, but get a different kind of output. And I don't want to have to necessarily do a whole lot of copying and pasting or retyping because all of those things create opportunities for typos and failures. But as I said, this line right here decides what, when you run the script, how is it going to be rendered? In this case, it's an HTML notebook. So right here has basically the exact same name, but it ends in .nb.html or notebook.html. And it's a self-contained HTML document that you could send to somebody like you would send a PDF or a Word file. Word, Microsoft Word is another output option. And then, and then here you can see how the natural language is rendered in a more elegant way, right? So I haven't actually got a whole lot of natural language in this document. I've got some. But that's why I do it, because it enables me to keep my analysis right in the same file with my pros. And then if my data changes, I can change my pros to reflect how I want to explain that analysis change, or if my analysis changes. If I keep it all in one place, it's more likely to be reproducible in the future. Hi, John. I had a question. Could you repeat what you said about the limitations that the Twitter API puts on retrieving tweets? If I put in a hashtag that's very popular and I set it as n equals a thousand, how does it choose to sample those tweets? Yeah. I actually don't know the answer to that question, Brenda. It's a great question. I'm glad you asked it. I've just never looked into exactly what it is. It is definitely you're correct that it is a sample. I wish I had mentioned that earlier. And as I have understood it in the past, particularly when you didn't have the academic access, there was no way to guarantee that that sample would be the same sample if you ran it again, right? So for example, anytime you run a statistical analysis where you're setting a random number to use as a seed, you can record that random number and get the same results. But as far as I know, you cannot do that with Twitter. So you just have to collect the data that you can and you keep that data around with the IDs. And then you rehydrate them as necessary. If you rehydrate the IDs, you'll get the same data. But as far as getting the same tweet population, I don't know that you can guarantee that. I don't think that you can. But I want to recognize my own limitation here as not being a Twitter scholar. If that's possible, I'm just unaware. The other thing to point out again is that if you apply for that academic access, you're going to have the ability to get more data. And so by having more data, maybe you can, I would have to look into it. I'm sure you would, too. You could at least set up some barriers to ensure that that consistency exists. And for what it's worth, I looked at the academic link that you put, which is the correct link, but it's not loading on Twitter. You click comply. Like from Twitter, it doesn't go anywhere. So I don't know if that's it. I'll check later and see if they fix it. I'm glad you said that. It makes me initially wonder if you have a developer account because there's, so it's sort of almost like three steps to my understanding is one, you get a Twitter account. Two, you get a developer account, which is free. And then three, you apply for your developer account to be an academic. So that may explain why it's not working if you don't have the developer account yet. That's helpful. Thank you. We'll run through this, an example of text mining. So this document exists in that repository that I shared with you. And as an NVHTML file, you can load it in your web browser and read it. And at the top is a link to that text mining book that I mentioned, which is online and free. You can scroll through it. There's a really important section right here on analyzing word and document frequencies. But all of these chapters are relatively brief and concise. Ngrams is an important concept, sentiment analysis, an important concept. The book gives you practice data. If you're interested, the practice data is all of the Jane Austen novels, which are just amazing novels if you haven't read them. And if you have read them, it's even more fun to go through the process because basically you get to quantify things that you already know about these very popular novels. But also, I want to point out that there's also an important section here on topic modeling. All of these are just introductions. They're very complex topics in general. But Chapter 7 case studies on comparing Twitter archives is very useful. Now, I'm going to run through this as the report rather than live coding because the Twitter data changes constantly and there's some visualizations that I want you to see just as demonstrations. And if I gather the Twitter data live in real time, I might not get, I might take longer and I couldn't necessarily guarantee that I was going to be able to show you what I want you to see. So in this case, we're just going to read the report. But in any case, the same two libraries are vitally important, tidyverse and R2E. And then tidy text for text mining. And we're going to make a word cloud. I have a caveat about word clouds. Some people hate them. A lot of people hate them. Some people love them. And I'm just going to show them to you because it's easy to do and it's visually interesting anyway. But the first thing I did is I searched for some tweets on a particular topic. There is a there is a online game that my son and his girlfriend showed me that you can play on your phone called Among Us. I don't know if you played it. It's kind of fun. It's a little bit like Wink. And the reason why I picked it is because I was looking for something that wouldn't, you know, when you're searching Twitter, you can get all kinds of garbage back. I didn't want to put that on the screen if I could help it. But I found a hashtag called Among Us Art. And it's really kind of, it's fun little style of art that's sort of hand drawn. So that was the hashtag I searched. I don't have it recorded here, but at least it didn't display here because I didn't display the code for that. And then I got back several, the same 90 columns that you're expecting. In this case, a thousand rows and several users all tweeting on Among Us Art. I don't have a question. I'm having, I don't know how to access this HTML doc. I'm like, I'm trying to click the link and it's giving me like random code and I'm not sure what to do. So here's what you do. If you downloaded from GitHub, which is this repository, download and unzip. Make sure you do both. And then once you unzip, open up the slides directory and double click on the index.html. Unless you unzip it locally, it won't actually display properly on GitHub, right? Oops, that's not the right, not sorry. I sent that wrong. It's not the slides directory, but it's the same issue. If you open it here and view raw, it's just going to show you raw HTML and a bunch of embedded stuff that you don't really care about. But if you download it, it'll display just fine in your web browser. All right, I got it. Okay, great. So we are here. All right. So I did a search on Among Us Art, ART. And the first thing I did, this is actually, I first did this code. I first composed this about maybe four or five months ago. And I didn't completely clean it up, but it still demonstrates the same thing. It exposes the same issues. So once I got my data back, one of the things I did is I filtered for is retreat. And I said false. Of course, I could have done that in the search that was up here. And then the other thing I did is under the text and the hashtag variables, I removed some terms. Because originally, before I did today, and this is before I did it for this week, I was doing some searches on election stuff. And there was a whole lot of just spam coming back. And I found these two hashtags that were especially spammy. And so using these supplier techniques, I made sure that I could eliminate any text variables or hashtag variables that had those particular terms. Now, fortunately, the Among Us Art search returns back some really clean, wholesome, great stuff. And these, of course, don't have any effect because those terms don't exist there. But I didn't find anything that I needed to eliminate anyway. I didn't look all that closely. But now you have a technique, right? If there's stuff that you want to get rid of, you can do it something like this using, in this case, the stringer function called string detect. And so if you proceed that with the bang, that basically says filter where, maybe actually, because it has the bang, that must be negating that. I don't really remember exactly, but I think if I read that right, maybe it came back only with election night things. That makes more sense. Sorry, I had it initially wrong in my head because I didn't see that. In any case, that's supplier stuff, that's not Twitter stuff. So if it doesn't make sense, just reach out to me and I'll explain it better. And I got back this data frame, same kind of thing that we're used to seeing at this point, 90 columns. In this case, 225 rows. And then what I want to do now is I want to do what's called tokenizing. This is part of the tiny text approach where I'm just going to give a line number to every tweet. And then actually, that's not entirely important there. But that's what happens. And then I just count that up. And so what I can see is that this person, this Twitter handle, they tweeted 33 times. And this Twitter person tweeted 12 times and so forth. And that's going to be really handy for the text analysis going forward. Then what I did is I created, here's where I was doing more of what I expected, doing more filtering, creating a data frame called bad hashtags, things that I didn't want to create. Because down here, I'm going to do an anti-join with my clean data and the collection of things that were bad. Again, I had no effect on this particular data frame, but there are techniques that you may want to repeat in the future. And that would have, if I wasn't doing Among Us, that would have reduced my totals to even shorter but cleaner corpus of data that I want to analyze. And there's a little explanation coming straight out of the Tidring Text book about how you can do various anti-joins and filters and string text to do things. Stop words are a big thing. I don't know if you're familiar with the term stop words or not. It's a concept in natural language processing, text binding, text analysis, words that are not particularly important. Like all the articles mean that English language is like A and B, et cetera. And you can customize your stop word list. There are dictionaries of stop word lists by language. So you definitely want to use one for whatever language you're actually analyzing as well. Particularly since it's Twitter, there may be other terms that you want to add to your stop word dictionary. And then once you do that, you can remove those stop words from your overall corpus. And what you get back is a listing of every word and who it's associated with. So in this case, user called, I don't know how we pronounce it, esargi2 on line four, tweeted the word, or on tweet four, tweeted the word among us. And for every, we have this index of each tweet. So there's tweet five right there. And we know that all of the words in this column from among us down to this URL belong to this one tweet. So we're keeping a context of what the words, how the words relate to which tweets so that we can analyze them better later. Here's another example of stop words. I'll just point out that there are a couple different ways to do stop words, particularly when you're doing tweets. And it has to do with the fact that, you know, tweet data just is very messy. You can have hashtags in front of things, you can have these ad symbols in front of things indicating that it's a user ID. And there are things that you just want to take out of your text so that you can analyze it better. And so I have a couple of different examples of how to how to clean up your data. And then what I'm doing here is calculating the word frequency. So the word frequency is, you know, what's the username? What's the word? And then this was kept this nice little calculation showed up in the in the tiny textbook. I didn't come up with this. I don't think I ever want to. But in any case, counting the number of words in a tweet versus the total number of words that this person has tweeted, because they may have tweeted more than once. And then for every word, you get a frequency score, which allows you to plot that in this way. Basically, this is the whole goal here is that we have a word. And then what we do is we, it says up here spread, which is a function to turn long data into wider data. I use the pivot wider function, which is the modern version of spread. And so I took this data up here. And then if we scroll to the right, here's the first user. Here's the second user. Here's the third user or Twitter handle. And every word that first word is the first row has a different word frequency index, right? And that's what we're going to plot it. So we may want to plot, we're just going to plot two users. Sakurai, I don't know how to saccharate Yamazaki and Hina Yamazaki. And the reason why I picked those two is because they appear to have, every row is a word, they appear to have several words in common, right? Unlike, if you, if you look at here, this SRG2, right? SRG2 has three scores for, for this, this, these three words, these three rows, but this character, this Twitter, does not share those words. They're really only, between the two of them, they're really only sharing two words. So that's something you can play with, you know, how to get your, your frequency table right. And then how to, and which people you're going to choose to compare. The more you know about the population, the more likely you're going to come up with the right kind of comparison. I know next to nothing about this population. So it was a little bit of a challenge. We're going to plot that. I'm going to skip all the way down here to here. And there is an example of the first Twitter and the second Twitter and their index scores plotted on an XY axis as word frequency percentages. And I'll be perfectly honest with you. This is, this is where I know, and it's very clear to me that I am not text analysis scholar, because I'm not exactly certain how to interpret that other than to look at it. But it's, again, I just want to expose you to ways that you can analyze tweets or compare users. If I scroll up, I'm going to scroll up and just show you some simpler ways to visualize things. I mentioned that some people hate word clouds. But if you wanted to, if you wanted to make a word cloud, you could take that same tokenized tweets, count up the words, use that as a waiting mechanism, and then display your word cloud like this. I recently heard Professor Chris Bale say friends don't let friends create word clouds, which I thought was, you know, a lot, it was very cute, but also represents the kind of way many people hate word clouds in sort of a similar way to if you're into, if you study visualization, many people hate pie charts. And I would take a slightly more agnostic view on that, that there can be reasons to use these visualizations. And you should probably think clearly about why you might use them. My reasoning was that this was easy to generate, and it clearly shows you that at least among this population, after it's sorted out, which words were used most frequently, but what it doesn't show you is the importance of those words, right? How do they relate to one another way to show frequency of words that is, I think, more descriptive than this word cloud, but not as pretty, is to just do a frequency bar chart and display it that way. And you can get a sense of the relative importance of this among us part hashtag to some of these others, right, which makes sense because among us art was the search that I actually did and all the rest of these hashtags and words just showed up in the tweets as a result. One more example there, where I'm using the anti-join on stop words, but let's keep going. Section on word usage. So I took my tweets and I got a sense of the range of the timeline range. Same thing I was doing with Rihanna Gibbons back in the other script. And what I'm doing here is creating word ratios to find out how often each of these Twitter handle users uses a particular kind of word so that I can compare them and create a ratio. In this case, it's just those two, those same two Twitter handle users, right? I've subset the data in such a way that I'm just looking at those two users and then created a ratio of each word that they're using with a negative and positive score. And then essentially, I can create a nice chart like this where I can see that the use the Twitter handle user that's represented in kind of a pink red uses the word cool logarithmically way more often than Twitter handle user represented in blue who uses the word cat way more often. That's their most popular word. Another thing that's popular that's most overrepresented are these URLs that I should have cleaned my data a little bit because that probably was not all that useful. But it's one more way of looking at analyzing visualizing representing tweets. And then that brings us to a couple other sections, things that you could learn about. I didn't I didn't write out the code for these, but there are links for these that go straight to that book that I was mentioning. So you can there are ways to understand and visualize favorites and retweets, ways to understand and visualize changes in words. But I wanted to introduce to you this idea of a document term matrix or a term document matrix, and also an inverse document matrix. These are common sort of approaches in I can't get my common approaches in text analysis. And so term frequency and an inverse document frequency are basically ways to identify what words are important and weight words that are used rarely and possibly devalue words that are used real frequently, like like stop words, again, leading articles and things like that. And then ultimately, you can sort of adjust for how rarely how rarely a word is used. I lost my screen, I didn't mean to do that. There we go. Hopefully, you can still see it. And then you can visualize that. So all in an attempt, there's a link right there that where you can read more about TF IDF. But once you run through those kinds of screenings and calculations, you can end up with here's a here's a calculation on the frequency of the term among us out by this one user. And then here is what am I doing here grouping by user getting a summary of the total number of words used. And then once I do that, I can combine those into one table using a join. And once I have both of those columns represented for each word, I can then use this function. This is a very complicated function. It's not complicated to use, but it's complicated to understand and represent. And again, this gets into the scholarly aspect of text analysis. But this TD IDF variable and the TF variable and the IDF variable, all those are the ones I want to calculate. I can do that with that function right there, bind TF IDF. And in the end, I can get a chart sort of like this, which shows or should show the words that are considered most important by these four users in this case, and how they might relate to each other. So I have to tell you that I tell you all of this with a bit of sheepishness because the more I study how I can make R do these things, the more I realize how little I understand about text analysis. It really is quite a rich field and quite a dynamic field. And I want to encourage you to go way beyond what I'm doing if you're going to use this in your scholarship. Since I don't publish articles, it's not necessarily important for me. My goal is to expose you to the resources. And so check out that book, check out the resources at chrispalessite.io. And you will be on your way to learning much more about text analysis. But we have about 10 minutes left. And I don't really have anything more to say, but I am happy to answer questions if anybody has any. And I also want to encourage you, by the way, if you want to talk about how this relates to a specific research project beyond just surface level questions, you might want to reach out to me for a consultation where we can spend an hour kind of really digging in on aspects. But in terms of general questions or clarifications, it's open mic or open time through something into the chat. In the meantime, I see that I have been ignoring chess that came in. So if anybody asks a question, go ahead and then if not, I will try to read some questions. Okay, I see that I'm not sure how to pronounce it, but Gul, who has been here before, was giving some information on how to read raw HTML files that show up on GitHub. And Faheem wrote, not related to tweet though, I see there is an argument in the pivot wire function as values equals to zero. What does that mean? That's a great question, Faheem. What that means is when you're pivoting your data from tall to wide, if there are, if there are empty values, you can assign them or you can fill them with something. So I didn't do that. But if you, I'm pretty certain, let's see if I can do this. I'm going to go in here and give it wider and hit F1. Oh, that didn't work. Why did that not work? I'm going to go over here to help and type pivot wider. So from this documentation, what can be helpful is that there is some practice data, fish encounters, and us rent. And there may even be an example of, let's see if there's values. Yeah, right. Well, there's a values function. There's fill. There's an example right there in that code. You can try that out just to see how it works. That bit of code, fish encounters would be a data set that is now on your, if you run this library that it comes from, which is TidyR, I think, might be DePlyer, that's TidyR, then fish encounters will exist and you can run that sample code and try it out. Okay. So that's everything I've got for you. But I want to, again, make space in case you want to ask a question out loud.