 Hello everybody and welcome to Working with Twitter Data. My name is Joseph Allen and I'm a research associate with the UK Data Service at the University of Manchester. And first of all, as with the self-seeking theme of social media data, if you're on Twitter, feel free to validate me by following me there or sending any questions about this talk that way. You can also email me at joseph.allen at manchester.ac.uk. And I sadly have to start this talk with a word of warning. We're about to grab some real tweets from the internet. And as such, this means that sometimes we might pluck a random tweet that has the potential to be sexist, racist, homophobic, contain nudity, and all sorts of other stuff. So all I can really do is apologize in advance. I'll try to keep that stuff off screen if I can, but we'll see. And I timed this yesterday. It took about 65 minutes, but I know we've got a 90 minute slot. So over the next sort of 90 minutes, we'll cover generally what social media data is and why it's useful from a social science perspective. We'll go further to talk about why Twitter in particular is useful over those other social medias. I'll whiz through a collection of getting started projects to try and inspire you. And then we'll take a quick break and move over to an interactive quiz, see what you all think of some ethical concerns I have about Twitter data. We'll then cover some basic analytics tools. And for those of you new to programming, this will sort of be sort of an entry-level way of looking at Twitter data. That includes Twitter's built-in analytics and something called twi archive. Then we'll perform a demo using the automation tool pipe dream, allowing us to collect some tweets without worrying about the Twitter API or any programming. And then finally we'll dip properly into Python. No, we won't. We're going to be using a command line tool to play with the Twitter API and use Twitter's full archive search to explore some vegan data. So to start with, let's talk about why any social media data might be worth collecting. Social media is one of the first times in history we have had such consistent access to an individual. When we try to represent an individual, this is usually done by listing some relevant facts about this person. The term I use for this is facticity. This covers who this person is now and who they were, but usually none of their potential. You might, for example, have a data set of names, ages, sex, the number of siblings, salaries, and other personal data. This set of data is very useful if you want to snapshot of a demographic at the time of collection. The six months later, our analysis can already be dated. Data collection on the number of staff working from home was very different in 2019 than it was in 2020, for example. When we collect data on an individual, we also make a lot of assumptions. We assume that our data is recent enough. And as such, we assume that our data is recent enough to still be relevant in our study. We assume an individual didn't lie to us when we collected the data, perhaps covering up societal flaws such as drug use, weight, and similar things. We also assume that the individual isn't actually in denial themselves. They might underestimate how many times they've broken certain laws or have an altered perception of themselves. We can also assume that events outside of our data set can be ignored. This could be a pandemic's effect on work culture, but this could also be hidden traumatic events or societal pressures that can change an individual in quite unpredictable ways. So if we take these snapshots, for example, say our individual currently has a salary of 25,000 a year. If we assume this individual is the same as everybody else, we could probably assume that their salary steadily increased to this point, and we'll continue to do so. Maybe we can assume that their salary steadily increased to this level, or that it jumped to this level after a recent job change. Or maybe this individual was somebody who was dissatisfied with their high-paying career and decided to change fields altogether, trading their income for a better quality of life. The thing to notice here is that our data set has the ability to not actually represent the individual. Extrapolating and assuming is the best we can do with this data. Another example of this could be the column in our data set labeled diet. Our individual happens to have the label vegan. Are they a strict vegan? Do they take a break on Christmas Day to avoid arguments with the family? Does that make them not a vegan? Why are they vegan? Do they value animal lives? Did their doctor tell them they better be vegan? Otherwise, bad things are going to happen, and this worked? Did they transition slowly to veganism? Or was it a snap decision? Did they wear vegan shoes? And finally, will they still be vegan tomorrow? This label in a data set gives us no indication to the answer of any of these greater questions. With a data set comprised of a list of facts, we have this snapshot view of the individual. As I said, we understand the facticity of this individual. But what we seek to know is what I refer to as the individual's transcendence. That is their ability to transcend the immediate situation they're in. Anything could happen to this individual outside of our data set and data collection. Maybe they have a rough day at work. Maybe that restaurant that said they had a great vegan menu only had a bean burger. Maybe this is the last time our individual can handle eating a bean burger. And that is the final straw on what was a shaky ethical foundation or an order from a doctor. So let's look at this question again. Will they be vegan tomorrow? We can make predictions on this. Statistically, people who are vegan today are probably vegan tomorrow. What if we knew their age, salary, job title, and more? Maybe these things correlate with staying vegan for longer or a higher confidence that someone will retain veganism. And with that, we can make a slightly better prediction. Or maybe that's all overkill. Maybe we want an answer to that question. They would probably just tell us if we ask them. They do it for free and they do it all the time. And this is where social media grants us access to the data set we always thought we wanted. Social media can be defined as technologies that allow the sharing of information via virtual communities. It's no secret that today many people share facts about themselves freely with other people, academia, businesses, and governments multiple times a day. It's easy to see why this might be useful to a researcher or to the industry. So how would a human try to answer our question of whether an individual will be vegan tomorrow? Well, the gut response isn't machine learning. It's human learning, right? We use our skills of communication, linguistics, and intuition. In the real world, we could simply approach somebody and ask them, how is veganism going for you? From their response, we can understand their words, their body language, the intonation, sarcasm, and even ask follow-up questions to dig deeper. Social media data brings us closer to this conversational approach with all the complexities of managing that conversation, which leads us to the next big question. Why should we look at Twitter? Why don't we look at Facebook or any other social media? So enter the tweet. It's got a 280 character limit, which means that we get very succinct messages, and these tweets usually make sense without needing outside context or the replies or full conversation. Because they're so short and detailed, they have quite a concentrated sentiment and a very clear topic focus. Again, a fantastic source for data analysis and text analysis. This content is easy to scrape with all web scraping technology, and Twitter also offers their own modern API for more complicated searching and filtering and sampling. Tweets are considered qualitative data as their natural language, but increasingly that natural language is entering the domain of data analysis and machine learning as we further understand human languages. So with individuals willingly publicly stating their views alongside geolocation, timestamps, you can understand that this data set is highly desirable. And Twitter works really hard to facilitate the sharing. Remember, Twitter is a marketing platform, and the users are the product. They want you to find it valuable, easy to use, and to encourage you to share how you feel. So this is great news from a social science perspective, but maybe not good news from a societal perspective. But what are some of the disadvantages of Twitter as a data source? One is that the open source packages that make collection of data so simple are also competing with one of the world's largest tech corporations at the speed of which their technology changes. These packages are frequently out of date with the newest features. It's also said that Twitter is marketers talking to marketers about marketing, and a lot of content is very self-promoting or even commercial. You need to take this into account when using Twitter data as the foundation in any project. And the user base of Twitter introduces its own bias. The user only represents some of the people that have access to the Internet. Most Twitter users are ages 24 to 35, 70% of them are male, 80% of all of the tweets come from 10% of the users. So only you can answer how this bias will affect your work. That being said, Twitter also has some great benefits. Each social media has the ability to set the barrier to entry to scraping their content, and some make this really difficult for bots, but many now offer an API to anybody with valid academic reason to access their data in bulk. In this regard, Twitter, I would say, is probably the most accessible social media. There are a handful of packages which do make it easy for developers to collect the data, and some of these require very little programming to make use of. Twitter also, as I say, has an academic tier to its API allowing the collection of 10 million tweets a month, and generally Twitter receives quite a small amount of controversy compared to other social media, swinging elections or allegedly installing spyware on phones. And then finally, that character limit of 280 characters reduces rambling and means that whatever we collect from Twitter is generally quite focused content. So I find that whenever I attend a talk like these, the most valuable thing I can gain is some sort of quick entry level project ideas or inspiration. So to kick that off, we'll briefly look at some sentiment analysis. So as you may have noticed, I'm quite interested in the perception of veganism on Twitter. It's quite easy for me to find users who seem intent on sharing their opinions of veganism as well. So when I see that Vegan Boy tweeted, I'm getting pretty tired of this whole hashtag vegan thing, I knew something was wrong. I was picking up that Vegan Boy thinks some aspect of his lifestyle is exhausting. I can search this user's history specifically to see what more I can find out. I was excited here. Had I detected a vegan about to publicly give up on veganism? But no, I hadn't. A few months later, things seemed to be back to normal. Vegan Boy tweets it has never been easier to be vegan and it's finally getting cheap. Again, using my human interpretation of the language, I can infer here that Vegan Boy is feeling quite positive about veganism. Vegan Boy might be smart enough to keep some of these opinions to himself, but there are always millions of other users I could look at instead. And what if I didn't need to use any human skills to understand how Vegan Boy was feeling? What if I could use a few lines of code to replace me? It's usually easy for us as humans to look at this sentence and determine whether it's positive or negative and which of the topics have that positivity or negativity associated with them. For example here, it's clear to us that Vegan Boy is feeling positive about food and negative about the government. There are a few ways sentiment analysis can work, but let's go into detail on an example. So to start with, we need a training set of sentences and labels that label them as positive or negative. We have some sentences where that sentiment is obvious, I love cats, it's clearly positive, I hate the weather here, it's clearly negative. But we also have some more nuanced sentences. I love the government taxing my hard earned money. To us it might be clearly sarcasm, but it uses strongly positive language like the word love. Again, I hate you, you're so silly, could perhaps be a playful sentence said to a friend, but from a text perspective it's clearly negative, though the sentiment could be positive. As long as these cases are rarer than our usual case, our model should adapt to dealing with them. Our model might begin to recognize that certain words share with frequency in particular sentiment labels as well. We get positive words like love, cats, amazing, great, and we get negative words like hate, weather, struggle, sleep and stub. We also have what are commonly called stop words. These are words that were previously believed to have no impact on topic or sentiments in a sentence. These are words like I, you, to, it, the, and others. So looking back at our recent tweet from Vegan Boy, we can see that he loves food and hates a government. Our model will score each individual word, ignoring those stop words we talked about, so I, food, and the as they have neutral sentiment in our model. The word love from our training data has a large positive sentiment and may give say a plus five sentiment score. The word hate has a large negative sentiment which may give a minus five sentiment score, bringing this whole tweet back to a neutral zero. Another quirk is that if complaints are frequently associated with a topic, for example, if people complain more about the government than they defend it in our training data, that topic itself can pick up negative sentiment, and this is really important for Twitter data. So this means that although government should be a neutral term linguistically, in our training data it wasn't, and this means that in our prediction, government will receive negative scoring, thus taking the scales and classifying this whole tweet as negative. So while the world of tweets and free text generally was difficult to analyze, we now have advanced tooling to help process this. Personally, I think sentiment analysis is the prime example of something that sounds complicated, sounds impossible, but it's actually fairly trivial and quite cool to implement. As such, we can begin a number of cool projects this way. Introducing cool project number one. I recommend trying to collect all of an individual's tweets and try to plot the sentiment of each tweet over time. Ask yourself whether these tweets correlate with any life or political events. If you pick yourself or a friend is that individual, you'll have a better understanding of their life, hopefully. I won't be demoing this one, but there is a demo in the GitHub I've linked, and that's all linked in the ends of these slides as well. Cool project two. Analyze your friend's amicable chats. Now that all free text is potentially quantitative, see if you can export your Facebook Messenger chats or your WhatsApp text chats. You can export your WhatsApp chat, read it into a data frame and apply this sentiment analysis to calculate which of your friends send the most positive or negative sentiment messages. Please tweet me and tell me how this goes, who's the most positive and which of your friends don't like that you've done this. And here we can see that of my friends, we all share a similar number of negative sentiment messages, but generally I send the largest percentage of positive messages, therefore bringing up the mood of our WhatsApp chat and definitively showing that I am the best friend. Another example was a funded digital art piece where viewers could tweet mean things to a rose, releasing poison water to the rose. Alternatively, if people sent nice messages to the rose, it would water the rose. And if nobody tweeted the rose at all, it would dry up and die. Hopefully the metaphor of social media is quite obvious there. And to test the system like this, we targeted infamous American politicians on Twitter. It's a really reliable way to make sure someone's sending a high sentiment tweet every couple of seconds. And if sentiment analysis doesn't interest you, my second favorite thing to do is to make a word cloud around a topic. Again, here's one I made using a WhatsApp chat with my friends, with some tweaks to the color scheme and shape, you can make some really interesting visuals. You know, for our vegan data, this could be a tree or a plant and use a lot more green colors. And all you really need here is just a big list of all of the text you use and websites will sort of process that for you, or you can upload a value count of those words. You can also take this further to sort of remove those stop words that are quite common. You can lowercase all the commonly used words. And you can even stem words. So swim, swim and swimming all refer to the same word stem swim. I have a blog post detailing this again. That is also in the references at the end of this talk. And the final project idea is to investigate topic analysis. With this, we can extract topics from tweets. We could classify which of our vegan tweets actually related to food prices, ethics, politics, diet. And we can also group similar words or topics, such as vegan, veganism and veggie. So that can be quite useful for our searching later on. And we could also seek out terms that are perhaps the opposite of veganism, maybe barbecue or steak or carnivore and see how these people react to the sort of similar vegan events. So next, we need to ask ourselves, is it ethical to use social media data? We could be so preoccupied with whether or not we can access Twitter data that we didn't stop to think if we should. In this section, I'll introduce some scenarios and I'll ask you as the attendees what you think. So I'm going to jump over to Mentimeter. I might use the outcomes of this presentation in an upcoming blog post as well. So if anyone doesn't want their results used, let me know. So scenario one, the University of Manchester is doing a study on how hate speech relates to veganism and plant-based diets. Your tweets about food and veganism have been scraped. In scenario one, is it okay for researchers to use that public Twitter data? Ah, lots of yeses. One no. A couple I don't know, brave I don't know. So it looks like overwhelmingly people think this is okay to use. I guess maybe we're showing the signs that we are researchers here and saying that we don't know. We haven't researched this yet. But it's looking sort of overwhelmingly positive. Very few people saying no, which is very common. You know, we should say this is sort of generally academia is for social good. This should be a social good project but you might not be happy with your tweets being scraped. That's totally reasonable. Question two, given the option, would you consent to your Twitter data being used in that scenario? So if the university reached out to you and said, can we use your data in this study about veganism and hate speech, would you say yes or no? Wow, lots of yeses. So many people said no to it being used before but given the option to consent, very different. Wow, really interesting. Okay, overwhelmingly yes then. Question three, how would you feel if that academic body reached out to you? So sent you a message or an email to thank you, inform you or compensate you that they've already used that data with or without your consent basically. And you can type. So this one is free text. So if that makes you feel happy, you can type happy, but I'll give a minute for this one because they tend to come in quite slow. Okay, validated, very interesting, bemused, confused. It depends if I've been saying anything controversial. I love that point. Yeah, it's fine in a case about veganism and hate speech if we're not hateful to vegans, but what if we were hateful to vegans? What if that's potentially a big problem? Confused, surprised, a bit weird. Again, I think this probably happens a lot more than people realize and I wonder just how many times these tweets are actually being scraped. Depends what the compensation was. Again, very rare that we're compensated for anything considered public data. Intellectual theft. I love it. Irritated. I'd be more annoyed by them reaching out. Not happy without consent. But again, it happens all the time. Great benefiting from something I didn't know would depend if I felt judged. Again, intellectual theft. Surprise, surprised, happy, confused scene. Wow. So big variety there. Really cool to see again. Would feel intrusive. I wouldn't mind them gathering my data anonymously, but would not want them contacting me personally. Interesting. Don't want to know. Yeah, so there's a big question here of is Twitter data personal? We can scrape your content anonymously, but did you tweet anonymous content? Have you ever tweeted to say like, I live in Manchester and my name is Joe Allen? Because that might still be in data that's considered anonymous. So pretty strange stuff there. Okay, I'm going to move on. Thank you very much for that. That was pretty cool. All right, scenario two, a large tech company has scraped Twitter data and the locations you visit, how long you spend there and how often you visit. They're using this data to profile you and target you with effective location based adverts. Is it okay for that industry to use Twitter data in scenario two? Okay, unsurprisingly, the exact same data source, right? Academics and anyone actually, any hobbyist can access your Twitter data, geolocations, all that kind of stuff. Anyone can do this. But when it's used for something to make, when it's used to make money, the answer is no, we shouldn't be able to use this. And when it's used for sort of academic research, generally it is okay. But really the situation is exactly the same for both of these. It's just that the sort of the social good is not there. So same question again, given the option, would you consent to that data being used, your data for business gain? So again, overwhelmingly no, you know, people don't like this, people don't like being targeted directly, people don't like their data being used for business gain without their knowledge. But yeah, that's what happens. Question six, again, how would you feel if a business reached out to you to thank, inform or compensate you for your public data in scenario two being used for location-based adverts? Angry, irritated. Wow. Okay, so people don't like this. It's the same scenario, but people don't like this, obviously. It's very upsetting. I mean, there's no study here, so this isn't academic. This is someone, you know, someone can do this already without asking you. Annoyed. Surprised that they were that honest. Yeah, totally. I mean, it would be great if the internet compensated users every time their data was used anonymously, but it doesn't. I would be happy to give me money. Yeah, I would like that. There is a brave web browser does that where all adverts give you like a fraction of the money spent on them. More uncomfortable. Wouldn't mind. Annoyed. So what we've got there is a lot more negativity, understandably so. A lot of happy if they were compensated, which I think is fair, because at the moment we aren't compensated, but this is really the situation we're in. But yeah, cool. Again, very good to see. Right next one, question seven, when Twitter data is used in academic research, should we inform each user that their data has been collected? So lots of knows as we've seen in that previous response, you know, people get weirded out when they realize they've been observed. But if people don't know they've been observed, they're not upset yet, or they think why would someone look at my tweets? So it's quite a strange one. And then even then just the process now with GDPR of reaching out to people that you're not supposed to be contacting can be considered a breach of GDPR. But the only conflict there is that it's public data. So anyone can look at it. Anyone can grab those usernames. Anyone could tweet anybody else and tell them that they've taken their tweets for a presentation they're giving and they're showing them to a group of 40 people without that person's knowledge. And depending on that tweet, you know, sometimes I've seen some pretty racist and sexist things in the process of making this presentation. And I don't know if they would like me showing them, but it's out there. So what are we supposed to do? Question eight should research with Twitter data require an ethical review from some sort of panel. Again, this is public data, right? So we have to sort of decide whether we are allowed to use this. But at the moment, anyone can use this. This data is available for anyone. Some universities do have an ethical review specifically for social media data. Wow. Yeah. Overwhelming yeses. Not surprised, really. Question nine, all universities require that personal data and related studies go under ethical review. Is the text from a tweet personal data? Questions are getting slightly more controversial as we go. Cool. Okay. So mostly knows lots of I don't knows. So we've got good wiggle room there. I would argue that it is personal data because from almost any tweet text, you can identify the user who made it. So it's all very good. We can say, okay, we'll hide usernames, we'll hide geolocation. But, you know, again, if I'm out there tweeting, like, I am Joe Allen, I live in Manchester, and I put that into Google, it will show my tweet. It's very good. It's sort of search engine optimized all tweets. So it's quite weird if we talk about the risk of re-identification. And then where do we go from there? So it's quite a strange one. But I think I don't know is a good answer. But so is yesterday. Question 10. Okay. So with this one, you're going to get a scale of zero to 10 and it'll plot some histograms and stuff. So you're not distributing points between all of them. You're just saying how responsible are each of these these groups for this problem? So how responsible are each of the following groups in making sure Twitter users and their data is safe? So if you think parents are largely responsible, you would put like eight, nine, 10, if you think schools are largely responsible, eight, nine, 10, etc. I love this question. We get some good visualizations here. Oh, someone says everyone's responsible fully. Or someone doesn't want to fill in the form. Yep. So this is stacking up just about as it usually does. Usually Twitter is the one that shows up, right? We say Twitter should be responsible for informing users that their data has been used in academic research. Twitter should be responsible for educating users on how to correctly use Twitter, all that kind of stuff. But then at the same time, the user is responsible, right? It's very common. The sort of common argument here is that things on the internet are out there. They're public. They're there forever. You need to be careful what you say out there. But at the same time, you know, we have children using these platforms sometimes that don't understand that. We can't really anticipate how poorly this data will be used in the future, right? Nobody thought that Twitter data was going to be used to try and swing a sort of election at any point. But these things have happened. Cool. Oh, yeah, business is using data for profit. Sorry. I was hovering over that. That's why that looked weird. Otherwise, it seems schools and parents have almost no responsibility in raising a new generation of people that know how to use social media, especially in comparison to the other stuff. Government intervention with these things, no idea how that would work. But yeah, I agree. I think Twitter is most primed to solve this problem. I think they distribute the API keys. So they should be able to give users the ability to be forgotten from data that is scraped using those API keys. But again, the academic tier of this sort of research API is very, very new. It only came around in the last year. So I imagine this is where things will go in the future. But we're not there quite yet. Question 11, should individuals have a right to be forgotten from research that is the ability to withdraw their public information? Okay. Well, unsurprisingly, people agree with this statement, right? Right to be forgotten is huge. In a lot of academic research, this is already assumed thing, right? We don't collect data without informed consent. And at any point, users can remove that data from their research, unless that data comes from social media for some reason. So that's kind of the point I want to make here. We mostly agree. We're mostly on the same page here. I probably shouldn't prime the question with that information as we're talking. But it's nice to see people agreeing. And finally, do we think Twitter data is actually even a useful source for research? Fantastic. Yes, we do. I will say, as we've seen, Twitter data is super biased, but most data sets are quite biased. I think just being aware of that bias will help. But we have to make sure we're not making huge policy changes based on, like, 20% of the population. Okay. That's meant to me to done. Thank you very much. I can share the results of that. So I'll add that to that tweet I sent out earlier that has all the results. So anyone can see that. But I might use that for a talk later on as well. So some arguments are being able to use Twitter data are as follows. This data is public. So we shouldn't need to seek any permission to access it. Any user posting content they are ashamed of should have known better. This is the internet. This data set is too valuable to ignore. It contains a wealth of opinion data on politics, company values, and societal trains. And there is no privacy concern. So we can ignore GDPR and any sort of data protection legislation that does come about. On the other side of this, I argue that we shouldn't be allowed to use Twitter data freely. And it should require some sort of intervention or ethical review or something. Just because it's public data doesn't mean all users treat Twitter as a public data source. While their intentions may not have been clear, that doesn't mean consent is the default option. We can ask some further questions. Do children know how their data will be used? Did all users anticipate the uptake of machine learning, targeted advertisements, and political manipulation that has only been possible thanks to social media? Are we bold enough to claim that we are through the worst of this misuse of public data sets? We've still got years and years of this to come. There's also the question of informing users that their data is used. Many users feel that their content is unobserved, and it is this data that we seem to be most okay with using. If we reached out to these users, I feel like many of them will feel disgusted, as we've just seen in that Mentimeter. People don't really like this. They don't like knowing that their content is used. But I can tell you it is being used right now. So it's a strange one. And beyond talking about ethics, I'd argue that a tweet's contents are sometimes unique. And as such, could uniquely identify any individual who created them. The justification that this isn't a problem is, again, that the data is public. But I think public and observed are very, very different things. In a more traditional study, we would give users the right to revoke their data from the study. But we don't offer this option to social media data. How can we give users the right to be forgotten and withdraw from their studies and further whose responsibility is this? As we just saw in Mentimeter, most of you think it's Twitter's. Most of you think it's the government's. At this point, Twitter data is already being used in over 500 studies at the University of Manchester. So we have to ask, have we already set a precedent here, or is this something we can try to fix? On the other hand, the University of Warwick has already decided that projects which make use of social media data require ethical review in some cases. Data being publicly available is arguably not justification for its unaccounted use in research and business. And users have no idea how often their content is scraped, or to what ends it's used. Users have little anonymity in this process. If their personal name and location itself isn't scraped, a large portion of Twitter's can be re-identified simply through Googling the Twitter text. Some very clever disclosure control would need to be used in order to get around this without fully polluting a dataset. As users weren't contacted, they also have no ability to withdraw from studies. And if they could withdraw from studies, how would we manage this when we scraped millions of users? Who would even be in the position to solve this? Again, I think it's Twitter. And so do you, it sounds like. Okay, so we've seen what we can use Twitter data for, and we've asked ourselves if it's okay to use. So let's start doing some analysis. To start with, I'm going to assume nobody here wants to code right now, and exploring the data we can still inspire project ideas and get an idea of what's possible. So to start with, we've got Twitter's built-in analytics. Twitter is a modern social media platform, and it provides an analytics service basically for its business users. You might have to opt in to turn this on for the first time, and there isn't really a trivial way to export this data beyond some web scraping technology, or just copying and pasting the numbers. But I will show this now. Okay, so here's my Twitter. I think you should be able to see that. Yes, you can. Okay, so in order to access the analytics, obviously you need a Twitter account, and you'll only be able to access your own analytics. So you click this little settings block here, go into this analytics dashboard. You'll get access to your sort of recent analytics. So you've got a 28-day summary along the top. This shows a little graph that, again, you have no way of exporting, and you can't really read any of the values on the graph, so it's not too useful. We can see that I've tweeted 177 times in the last month, which sounds crazy. It's up 400%. You can see someone's tweet impression. So this is the number of times that your tweet has shown up in somebody's timeline. So it doesn't mean they read it. It doesn't mean they liked it, re-tweeted it, engaged with it anyway. It's just impressions, which is why that number is so huge. The actual engagements will be hugely lower. Profile visits is the number of people coming directly onto my profile page, and again, we can see that has gone up. Mention number of people that are mentioning me in their tweets, and the number of followers I have and have gained in the last 28 days. We also get a monthly breakdown of each month, so we can see our top performing tweet. You can see mine was about some Python plot stuff for this talk. You can see the sort of best time I was mentioned by somebody else. You can see a summary of how many tweets, impressions, profile visits, mentions, and followers we had for this month on the side. Who our most important follower is, which is some MP or something like that, and the top media tweet, so what was my best picture, and it was about a bus stop that wasn't built properly, basically. We get this every month, so we can whiz through and you can see lots of cool data every month, and that can kind of formulate potentially a research question, but it's just getting your head in this sort of world of Twitter data, so that's one thing. Again, you can only look at your own for this, so it's not too useful. Hard to scrape, hard to get data out of. In order to do anything else, you probably want to get your whole Twitter archive, so to do this you're going to more, you go into settings and privacy, your account, and download an archive of your data. I'm not going to do that now, but yeah, if you do that, that will take multiple days to sort of prepare itself and to send it to you, but that's what you need to do if you want to explore sort of the full wealth of your tweets. I think that's everything for now. Cool, and then the next one, I'm not sure how this is supposed to be pronounced, but I've been calling it Twarchive. I don't know what that character is, but if you thought that Twitter analytics stuff was really lame, this is a slightly more technical step to use some open tools from a website called OpenHumans. Bastian Greszak-Sovarez is the Director of OpenHumans and it's Project Twarchive, which shows some really neat visualizations from personal Twitter data. They also offer $5,000 grant for interesting projects, so it's worth checking out if you're really enjoying this Twitter data stuff, and in order to use that we need to export that full Twitter archive I just showed you, and you basically upload this to the site and it generates some sort of visualizations for you, I think using Python. So the first one we get is where have I been tweeting from? It appears nowhere. This is my graph. I've been in the UK for at least the last two years, but there's no tweets there. So while tweets have a huge amount of metadata associated with them, including geolocation, users can opt out of this, and I have opted out of this because I think that's super scary, but a huge benefit of OpenHumans is that other users opt in to make their data open. So while I can't share my own location data, I can anonymously look at other people who have actually given consent for this to happen to their data. These visualizations can tell a really nice story. We've got a user here who probably lives in either New York or California or Tanzania or maybe goes on holidays between them and sends lots of pictures or something like that, but we can also kind of see all the places they've been in the period of time this data was collected, but without some sort of time visualization, we can't really do much more than this, and a limitation of these visualizations is that they're generated and given to you, but you don't get to play with them at all. In order to do that, you'd have to actually download the datasets yourself and do these plots yourself. Next we have tweets per day. This is an average sort of rolling 180-day mean because the number of tweets per day is usually quite low. It can be really erratic and difficult to read, so they add a lot of smoothing. Again, if you think this is too much smoothing, you have to go and investigate that data yourself. So all we're doing here is looking for some sort of project inspiration to see what kind of data we can get. I started my Twitter as I was running some tech and gaming events and used it purely for promotion, which is why there's so many tweets at the start. Around 2018, I started a tech meetup, and that sort of gradually increases. I've run more events. You can also see around 2021, that's when I started with the UK Data Service and started tweeting about the events I've been running with the UK Data Service. We can send replies, which is tweets in response to somebody else's tweet. We can do a retweet, which is where we reshare somebody else's tweet, and we can do a regular tweet, which is our own original content. This graph shows the ratios of these over time. Again, there's lots going on here, but without knowing what I was doing at the time, it's really hard to analyze this data or to sort of make any use of it. There's not a lot of context here. There's also a really cool graph for tweets by hour. I didn't have much data to generate this, so this is somebody else's, but there's a really cool split here between the sort of weekday and weekend trends. Again, it would be really interesting to see how this matches up with that geolocation data before. Do people act differently when they're on holiday, as to when they're at work and things like that? We can see here, people don't do much in the early hours, but they sort of, on average, wake up early in the morning just before nine. They start to use Twitter a lot. Seems like between nine and 10, they peak in Twitter usage, so maybe this is they're getting to work, they make a tea or a coffee, and then they go check Twitter at work and get paid for it. It dips, and then it rises again after lunch when they're back in the office, understandably, and then dips off for the commute home. Looks like people want to switch off, and then maybe it peaks again after dinner or with dinner, drops off, and then peaks again when they're going to sleep. Very typical sort of nine to five style users, I would say, and then the weekend just seems to be, doesn't make any sense at all, like lunchtime tweeting, but again, I don't know why it cuts off, or if there's just no tweets in this time. So it could be a nice project idea to sort of figure out why people are using tweets at this time, or trying to forecast upcoming tweet usage, so lots to do there. And this website OpenHuman specifically got quite a bit of media coverage, because it was a way to sort of surface whether tweets have a sex-based bias, and a lot of sort of reporters and stuff were being encouraged to use the site to sort of showcase that they're not sexist. But again, 80% of the Twitter users are men, so it should be that you're more likely to sort of tweet and see stuff from men. So this is my data for replies, but again, I live in Manchester. The data community for women is actually quite huge. We have her plus data, our ladies, Pi ladies, and all sorts of stuff. So generally it seems to be about equal. And then for retweets, I didn't have enough data to generate this one for some reason, but this is perhaps a better indication of sort of whose content we're echoing. And as I said, the majority of users are men, so it makes sense that there's a couple more retweets for men. That being said, though, there's no way to collect gender information from Twitter. Twitter doesn't ask for your gender when you sign up. It doesn't showcase your gender at any point. So in order to get these sort of genders, we used a Python package called gender guesser. That's what Open Humans is using, which has a training dataset of male to female names. So it's pulling your Twitter name and then making a prediction based off that name, which is obviously, again, a huge bias in any academic project, something that you have to state outright and sort of try to find a better solution for. But it's an interesting way of getting these visualizations. Okay, so that's everything we can get just from having our own Twitter accounts, but we probably want to analyze somebody else's tweets at this point. So things are going to get a little bit more technical here, but we're going to do almost zero programming. We're going to write, like, four lines of code just to calculate the sentiment, but otherwise there's no programming at all, and we will be playing directly with the Twitter API, but only through that personal account. So Pipedream is a tool for automating workflows. You can integrate with hundreds of common apps, such as Slack, Twitter, and Google Drive. And what we do with Pipedream is we listen for triggers, such as a new message has been sent in this channel on Teams or an event has been added to my Google calendar, and after these triggers, we can perform a series of actions, such as sending an email or storing a form response in a spreadsheet. Instead of worrying about developer accounts and managing API keys, we can bypass this using our personal Twitter account. Pipedream is an ideal solution for business or personal projects where you want to collect targeted tweets from this point onwards. The solution won't let you collect any more than tweets, any more than seven days old, so it's ideal for recording those live tweets, but if you need, like, Twitter data from 2019, this isn't going to help you. So I'm going to hop over to Pipedream now and do a quick demo. I'm just going to check Hoover to see if anyone's having any problems. Still seeing Mentimeter. Okay, all good. Give me one sec. Right. Well, I didn't need to pause the sharing. Okay, so this is Pipedream, pipedream.com. I don't think you'll be able to follow along with this right now, but this is a great tool for this. So once you've got an account and logged in, you can click New. And what we're going to do is we want to find out how we're going to trigger this automation. So obviously we're going to want something from Twitter. And in here, we can get all sorts of stuff. We can get some of these light tweets. We can get my tweets when I've posted. We can do something every time there's a new follower. But what we actually want is this one search mentions, and that will basically use the search functionality from Twitter. It'll ask you to use an account. So again, you do need a personal Twitter account in order to access a lot of this stuff. I've already connected my account. Here we can enter a search term. So I'm just going to search vegan. There's a whole list of standard operators that we can use here if we want to ignore retweets, if we want to get things from certain accounts, all sorts of stuff. But for now, I'm just going to go for veganism. And they actually give us a lot of dropdowns to add in those things. So we can get recent events. We can get popular events. I'm just going to keep it as recent. We can choose whether we include or exclude retweets and replies. We can even set the language so I can make sure I'm getting English tweets, or tweets written in English, sorry. I can set a locale. I'm not really sure how this one works here. This one might require a bit more reading. We can also set sort of a latitude, longitude, and radius so we can get tweets within a five-mile radius of London, for example. Enriching tweets will get the full tweet, so it's worth keeping that on. And then maximum number of tweets to return is 100 at a time, but we can run this every 15 minutes. So we can sort of slowly build up a big dataset, but it's not going to be the full wealth of tweets from today, probably. That's all good. And then I'm just going to title this, get tweets. So we click create source. That will create that as a trigger, and it will pull an event for us to use as a test. So we can see here we've got one tweet. This might be, okay, full text. Here we go. Hey, at Kellogg company, make mini-weets vegan thanks. Pretty easy. I don't think there's going to be any sentiment there, so it might be quite a boring example. If we grab this tweet, we can actually go check that tweet out on Twitter. So if we find any tweet on Twitter by clicking this, you can see it might be too small on the screen share, but this number at the end is the tweet ID. So if we replace that, Twitter will actually reroute us to the original tweet. Let me check that again. Those IDs are different. I must have grabbed the wrong one. Okay, so here we go. We've got Shane Turner. I've now grabbed his tweet against his will, but that's not a problem because it's a public tweet, right? Hey, at Kellogg company, make mini-weets vegan thanks. So that's how tweet, we can see it came from today, it came from half an hour ago, it looks like, and we can even check out where this guy's from. St. Catherine's, I don't know where that is. So we didn't do any geolocation stuff, but we've actually got all of that data in pipe dream now. So if we have a look, there's actually a lot of data we can see in here. So we can see geolocation data. We can see any one, any hashtags he used when the tweet was created. We can see all this user information. So we can see who that user was, how many followers they have, what their user ID is, what their location is, what their name is. We've got all this stuff in here. And there is a bit more data you can get in that, but we've got a lot of stuff here to use, basically. And if we click send a test event, what that'll do is it'll just grab that tweet again because it's just a test event. And we can see that shows up on the side. So that's not great. We've just grabbed the tweet and done absolutely nothing with it. So we need to add a next step. So what I'm going to do next is I'm going to calculate the sentiment of that text. So this is the only little bit of code we're going to do. We're going to write some JavaScript here. And we're going to use this package called sentiment, which has a little demo on how to use it really quickly here. So all I'm going to do is paste that code in there. So bar sentiment equals require sentiment. That will import any MPM package. So we can use any JavaScript package there. It's creating a new sentiment object. It's then using that sentiment object to analyze the sentence. Cats are stupid. It will then log out that result. So we can deploy this and run it. And it should log out. Cats are stupid. And the score if I send that test event again. So you can see it's getting that tweet. It's running this. And we're getting all these details here. Cats are stupid are the words. The only one with sentiment is the word stupid. And the total score is minus 0.6. So that's on a scale of minus one being extremely negative plus one being extremely positive. So we can sort of tweak that. Obviously, we don't want to check if cats are stupid. We want to actually get the data from here. Again, this is something PipeDream makes really easy for us. So this function is receiving an event and all of our step details. So if I type event dot, we can access anything from that object before. And in this case, we just want the full text. And we also want to return that result. That'll be fine. So now if I deploy it, we should get zero returned. I'm thinking because it doesn't look like anything in there is positive or negative. Here we go. Yep, nothing. So we've got all those tokens. Hey, at Kellogg, ellipses make many weeks vegan. Thanks. But none of those words have any positivity or negativity associated with them. And as such, our sentiment score here is zero. Okay, so we've got a tweet. We've done the sentiment stuff. Next, I'm going to write them into this Google spreadsheet that I have. So I'm going to add a single row. And we need to select our Google account. Again, I've already done this. Select our drive. Again, I've already done this. And select the spreadsheet. So all of my spreadsheets are available here. But I'm just going to write to my demo sheet, which I've just shown. And the sheet name, which is going to be demo sheet. So that will just choose which of the sheets this is going to go in. You can see I've done this a few times before. And what we can do now is just write values into those cells. So I think what I'm going to do is I'm going to get the username. I'm going to get the tweet text. I'm going to get the sentiment of that tweet. Anything else? Then we get I'm the creep was created as well. So tweet text and sentiment. Okay, so we need username so we can grab objects from these previous events basically through this through this browser. So the first one I need is username. So I'm going to scroll down to that user object. I can actually search it, I think. There we go. User and I'm going to get their Twitter handle. How am I going to get that? I'm not sure if I can see it. Let's just I'll just get that usual name then. Then I'll click a plus for the next cell. That's going to be the time this tweet was created at, which is this one here created at plus again, then we're going to get the tweet text, which I think it's called full text. And then we're going to get the sentiment. So this is the only one that's a bit different. This one comes from our steps. So this accesses data from all the previous steps and no JS step, the return value, which is the entire sentiment object, but I'm just going to take the score. And then if I deploy that and run that again, run the same test event. And it'll get that data. It should write it to the demo sheet shortly. There you go. So we've got Shane Turner created at whatever date. We've got the text. We've got the sentiment. So that's one, but we want more than one. In order to do this, all we have to do is turn it on. So we've already told it to get 100 tweets every 15 minutes. So it'll just do that forever. It's worth going into the settings here if you do this and take this limit concurrency. Otherwise it'll overwrite that tweet every time. So you'll still collect tweets, but until it registers that that cell has been written to, it'll potentially overwrite it. So it's worth putting that on. And then all we need to do is enable it. So if we go back into this tweet, we just have to enable this trigger. Now every 15 minutes, it'll do that again. There is somewhere to go to trigger this. Let me have a look. Here we go. So I can run this now, and we can see it's already had a quick run and collected those 100 tweets just to sort of do that test event. But if I run it now, it should only get ones tweeted in the last eight minutes. I don't like that. There you go. So you can see it's grabbed a bunch of new tweets from three, and it should be writing them here. They might take a little bit of time. There you go. So here they come. It's a good demo that. Okay. So we've got a very negative one here. I'm no vegan and never will be, but there is truth in this. The cruelty and massive killing is unnecessary. It's all to push markets hunger and make very, very evil people very wealthy. So negative sentiment. We've also got a positive one here just completed a week of being vegan. I have to say the way I felt this week was amazing. I see why people rave about it. But to be honest, it ain't for me, at least not right now, lol. But I did pick up some good habits. I'm going to incorporate going forward. Very positive. So there we go. So that is pipe dream and that will run every 15 minutes. That's a good way to collect tweets from now onwards. Let's hop back to this. Okay. So that's the pipe dream demo done. So from what we've seen in that demo, we can use this tool to access Twitter, Google Drive, and hundreds of other APIs without needing to understand what an API is. And as such, we can get away with not needing to know any programming languages. That being said, a little programming turned that process into an automated sentiment analysis tool as well. So it is helpful. But the downside is that we can't scrape those historic tweets. Next, we're looking at the Twitter API. So you've likely heard this word API before. It stands for application plugin interface, but that isn't particularly helpful. An API basically lets us access data from a source we might not have access to. So usually we use these APIs through helpful user interfaces, such as websites, mobile apps, kiosks, things like that. But when we need to engage with these directly, there isn't an app premade to help us. We normally have to write that app ourselves or use something like pipe dream. We could consider an API the waiter in a restaurant. They give us access to what we want as long as we play by their rules. Instead of engaging with our waiter through a website or mobile phone, we usually order items off a menu. And we know that we can't break these rules. We must order food that is on the menu. We can communicate any additional allergies we have or other security concerns as well. Instead of working with the database, our waiter then communicates with the kitchen to deliver the content we're requesting. API control is how we access most of the web. It handles all logins, sign up, distribution of private messages, all sorts of stuff. Twitter shouldn't be giving access to everybody's data. And that's the API is how they regulate this. You wouldn't want one user to be able to read another user's private messages, for example. Every social media gets to choose how secure their platform is. And as such, sometimes we have to handle various authentication protocols just to get access to the data we need. So how are we talking about the way the entire web works when all we wanted to do was collect a couple of thousand tweets for our research? I'm frequently overwhelmed by the amount of knowledge that's needed to reach this point. The implementation of web security is understandably complicated. But without this web security, we wouldn't have access to this data at all. Without the ability to lock a door, we wouldn't even construct buildings. And without the waiter, we can't open the restaurant in the morning to guests and everyone has to cook at home. So back to Twitter. The Twitter API is made up of various endpoints which expose data and methods. It's likely that the version of Twitter we use is a tiered version of this API. With that API, we can look up individual tweets or users, we can search recent tweets, we can stream live tweets in real time and filter them based on conditions. Using the free tier, you can only access the last seven days of Twitter data. The premium and enterprise tiers will allow you to access the last 30 days of Twitter data. And further to this Twitter website hosts all tweets. So with some web scraping, we could realistically collect all still public tweets, basically. But it's quite cumbersome. And as of August 2020, there's a newly introduced academic tier. So academia generally makes better social good use of data than the industry does. And our options have been to either use that free tier to collect small and recent data or to get creative with some complicated web scraping methods. This gives access to Twitter's full archive search exposing the full wealth of Twitter data back from 2008 or so. This fantastic resource saves us learning those usual work rounds, but your project must be non-commercial and you'll have to justify your need for the data and potentially even prove that you're an academic through that Twitter review process. Playing with the Twitter API directly requires quite a strong programming background. And through the tiered API, we can access all those tweets. If we can prove our academics, we can access 10 million tweets a month from all periods of Twitter history, rather than the sort of 100 that we were getting through that pipe dream tool. But we also need to understand those API keys a little bit and manage them. So now I'm going to walk through a demo using twerk to access some of that Twitter data, which is right here. Okay. So yeah, this notebook, again, it's in the links. You should see it, you should have access to it. Yeah, I can send the sentiment code. Sorry, I've just seen your message, Hannah. But this demo will basically go through how to explore all of that data. And there actually is another demo in this same GitHub repo that explores how to do that sentiment analysis in Python and pandas, if that's more of your thing. But yeah, we'll go through this. So Twitter search, yeah, this is the new V2 API. It was only released in August 2020 and it's relatively new and not supported by quite a lot of the Twitter packages. But twerk does support that V2 API. So let me have a little look. There's also a few plugins we can use to convert the output into a CSV file. And there's various other helpers that we can sort of install. It's a bit weird because it runs as a command line tool. So you will need to understand a little bit of the command line or Python in order to use it. But this notebook will help you with that. And if you're not a command liney person, I would suggest install something called Anaconda, which is this environment I'm using here. That hides a lot of that package management stuff away from you and you should be able to basically use this notebook to do a lot of what we're doing now. So installation, as I said, Anaconda, you're going to need Python, which you can check if you've got installed by typing Python. You should be able to run some little maths like 3 plus 3. You should also have PIP installed, which you can check by typing PIP or PIP3. And that will give you manual pages. But again, if you've managed to get Anaconda installed, you're not going to need all that stuff. There's also details here on how to install twerk, which you can run with PIP. Or again, if you're using the notebook, you can run just this cell and that will install twerk for you. At this point, we have twerk. But what we actually need is twerk2. So twerk supports the v1 API, which doesn't have the academic tier. Twerk2 supports this new v2 API, which will give us access to historic tweets. But there's a whole page there on installing and upgrading to twerk2, if you need it. Next, we have an annoying bit. We need to deal with that whole API key and authentication stuff that I can't stand, but I'll show you how to do it really quickly. So if you do get approval as a Twitter developer, you will have access to the standard tier here. And if you can prove that you're an academic, you'll have a tab for the academic research as well. What you can do here is create a app. I'll just call it Twitter demo with some random numbers after it. This will give you your Twitter API keys. So you could try and steal these now, but I would be very annoyed at you if you did it. And I'm going to delete this app as soon as I finish the talk. So don't try and copy these down. But this is what you need to actually interact with the Twitter API. And we don't really need to worry about what all that stuff is. We've got them now. So what we can do here is we can run twerk2 configure. And that will launch a little helper that will basically ask for that detail. So the first one it's asking for is the bearer token, which we can see here. Next and last, do we want to add API keys, which we do. So we type y for yes. And it will ask for an API key, which is this first one. And it asks for an API secret, which is the second one. And then it asks how we want twerk to obtain our user keys. If we put one, it will basically give us a Twitter link that we can go to to authenticate instead of doing anything more complicated. That will ask if you want to authorize our app to use our data, which we do. And it will give me a little pin to verify that. Also paste here. There we go. So that is twerk set up now. We can actually access the archive search from this point on. So just as a test, I've got this little line here that I'll go into in a second. It shouldn't take a second. It's only getting 100 tweets. So it's quite quick. So that's already done. So if I look in my folder now, you can see I've got this 100 vegan jsonl. So it's in a bit more of a difficult to read format. But we can see just trying to find something. There we go. We can see their description. They've even put vegan and it's an entity, right? They've got vegan and veganism in the tweet. So it looks like we've pulled some vegan data. Where has that gone? Yeah. So to break this down, twerk2 says we're calling the v2 API. We potentially want to get academic tier data. Search means hit the search endpoint. Oh, that's why I shouldn't have done that one sec. Here we go. So search means hit the search endpoint, which is what will happen if you go on Twitter and you type anything in this bar up here. That's the search endpoint. Limit will say I only want 100 tweets. If we don't say this, it'll try and give us all the tweets from forever. And we can basically fill up our entire search in one query, though you would have to leave it running for days and days to do so. Then vegan is my search term. Vegan is my search term. So we're looking for tweets that contain the word vegan or users that have the word vegan in their username. And then 100 vegan is the output file we have. We can actually get, as you saw here, 10 million tweets a month, which is quite good, but also not that much. There's about 500 million tweets made every single day, billions in the whole year. So you have to be quite sort of succinct with what you want to get back. So my next questions would be, did we hit the archive and have we actually got tweets about veganism? So in order to do that, we need to open this data. So I'm importing, I think I'm going to skim through some of this stuff now as time is getting quite tight. But again, if you don't know what this is, this notebook is available, and obviously you can email me and tweet me if you have any questions. So what I'm doing here is just loading in that data set and printing out the last five rows. So we can see some IDs. So we should be able to use these and check out the actual tweets themselves. If we hop here, I've got coming up with a whole food healthier option for dessert, vegan weight loss, all sorts of stuff going on in here. So it's clearly a vegan tweet, but it is from today. So what Twitter has tried to do is they've tried to help us. They've assumed that we want to see the most recent, most popular content, which isn't necessarily the case here. I want archive data preferably. But yeah, there's a whole section here on how you can verify those tweets in that way. Next, I'm going to convert to a CSV because that JSON object isn't really easy to use. There is a plugin that does this for us that I've already got installed, and it's quite easy to use. So we just go talk to CSV. Park to CSV to use the CSV plugin. Then whatever the name of our file is, in this case, it's 100vegan.jsonl. And then where we want it to go. So that'll be 100vegan.csv. And that's the conversion done. You can see there's more than 100 tweets in there, but that's because some of them are reference tweets. And that means they're replies or retweets or things like that. And if we look in here again, you should see we have the CSV as well. And that sort of just makes it a bit easier to analyze in quite a lot of tools. A lot of tools aren't built to sort of open vegan, oh my gosh, JSON data is what I meant to say. But in there, we can see all of our text tweets. We can see the users we can see created. It's a bit easier to open basically. So now I can open that up. And it's a much easier object to explore in pandas basically. I always call the info function in pandas, which will just detail how much of this data is missing, how much of it is what type that data is, and things like that. We don't really need this terminal much anymore. And also with the describe function, you can see some basic Matty stuff around there. From a previous attempt, I got about 1.2 million tweets just looking for just vegan tweets in the UK in January 2019. So people probably tweet more now than ever. That took multiple days to run. That resulted in about a five gigabyte file on my computer. So again, the scope of your projects needs to be quite small. You need to have a very focused research question if you know what you're going to look at and probably somewhere to store the tweets if you're looking at a lot of stuff. Okay, so a quick case study I did is how did the vegan perception change over Veganuary 2019? This was the first largely adopted Veganuary. Famously, Greggs announced their vegan sausage roll. Pierce Morgan made fun of them for it. People showed up outside Greggs to protest. There was a lot of talk about sausage rolls in this time. It's quite a useful way of verifying where the tweets were coming from the UK as well. So to build up that term, first, I'd use twark search vegan, vegan jsonl. So use the V2 API, search endpoint, search for the word vegan and output it to a vegan.json file. If we don't run it without limit, that will just keep going. And that will, you know, that will grab like thousands of tweets every couple of minutes. So don't run that without this limit flag. Next, we can also run. So we don't really just want to look for vegans. We'd probably want to look for like vegan, veganism, vegetarian, plant-based, all sorts of stuff. So it's not necessarily trivial, but again, they have a documentation page here on how to break down those, those sort of terms. So this is what I've got here wrapped in brackets vegan or vegetarian or plant-based to just get all of those different things. Next, we want to add something to limit the search. So adding that flag limit with 100 means I only want to get this many tweets. You basically want to sample the dataset you need. It's very unlikely that you'll be able to get all the tweets on the topic you're interested in. And even then, you'll have to do some filtering. You'll hit these rate limits and all sorts of other stuff. Then there's the archive flag. So we have to use this to make sure that Twitter knows we want old tweets and we want to use that academic tier. So we just add that archive flag there in order to force us to use that instead of collecting all the recent tweets and then collecting the further tweets. We can add a start time and end time in sort of a year-month date format. And then we can add a location search. So if we add the place term there, I don't know how to describe that, that will get the location from a user. So users can self-define their location. So I've got Manchester, England here. That's what that place refers to. Sometimes people put pronouns here. Sometimes people put memes here. So it can be a bit unreliable. But if we're searching for London, there will only get people who say they live in London. But again, that can be out of date, right? That could be somebody who moved away from London five years ago and just didn't update their location on Twitter. So it's something to sort of, again, mention in your work. Here it is. Okay. And then the other one is a point radius. So that will only get tweets from that actual location. So even if the user is defined as living in Manchester, if they're in London and make a tweet, that will show up. But this will only return tweets with tag geolocations, which is about 2% of all of the tweets that come out at the moment. So that can look something like this. You can use a tool to find the longitude and latitude as you needed. So that's pretty much everything there. So that's a full built-up tweet. I know I whizzed through that. Apologies. If I run all that, that will get 1,000 tweets. And again, this is all in the GitHub. You can look at this. Again, it's sort of written in a report structure rather than a presentation one. That will take a little bit more time, but not a lot. That's probably slowing itself down so that the Twitter API doesn't block us hitting it too much. So there we go. Now we've got that new file, vegan 1000 London. And then we just convert it using that twerk2 package again, using that CSV package, sorry. And that will convert it to a dataset we can actually use. Again, very quick. And it's not there yet, but it is there. So that's our 1,000 tweets about veganism loaded way quicker than the pipe dream stuff. And obviously 1,000 is a very low bar. I could have asked for 10 million, but we would have sat here for a long time looking at very little. Okay. So that's the twerk demo. Again, that's all in the notes. Let's see how much more. Okay. I think we're all right. We'll have time for a couple of minutes of questions, I think. Okay. We won't actually be doing any exploration of that data, but there is another notebook in there called tidying tweets or something like that that has my one hour exploration of that dataset. And with all this, you should be able to decide which of these tools is right for you. So Twitter sort of built in analytics lets you explore your own tweets, but it's mainly used for business. Should only really be an inspiration project to archive. Again, gives us some more slightly advanced searches, but again, only really useful for inspiration or sort of some basic analytics of your own data or other data from the open humans datasets pipe dream is the first tool I would suggest is really a solid foundation for sort of academic research. It's quite a slow collection and it requires a long time to sort of build up an actual data set you're going to look at. And if you make mistakes, you can't really undo those mistakes or rerun your code with those mistakes fixed. The Twitter API you can play with directly, but it requires loads of programming. So I'd suggest using some sort of helper package. And a lot of these helper packages don't actually support the academic tier, which is why I recommend twark. Twark does have that V2 academic tier. Doesn't really require programming, but will require you to have some sort of command line skills. So it's not necessarily trivial, but it's something you can definitely start with and find help with.