 Hello everyone, welcome to this UK data service webinar on the Twitter timeline. Your presenter today will be Peter Smyth of the University of Manchester. So what we're going to be looking at today is the Twitter timeline, but the various aspects of this and things we need to set up before we can actually get any data. Basically this is the lineup. We're going to look at the keys and authentication. I think it goes without saying that in order to get any UC, the Twitter API, you need a Twitter account as well. But the keys in authentication is an addition to your Twitter account. We're going to look at the, well I'm going to mention the new API and the developers account or the academic account, that should really be. I would say that I'm going to try and do this entirely in a non-technical sense in that we won't actually be seeing the code, but we will be seeing the results of having run the code. So for non-technical persons, you should be able to follow along quite happily with this. But things like API, I can't really avoid mentioning at some point. We need to actually decide what is and what isn't in a user's timeline. It's not what we can actually get and what a user would actually see on the screen on quite the same things. And that's a bit unfortunate because there's lots of things on the screen which we can't actually get from the API. But we can do a better job in simply getting a simple individual user's timeline. So we'll go through the process of extracting a complete timeline. Now here, when I'm using the word timeline, I tend to use it in two different senses. One is an individual's timeline, i.e. what they would see if they were looking at their Twitter account. But timeline is also used in the API to mean the tweets of an individual user, or actually the 3200 most recent tweets of that user. In fact, that's only telling you part of the story because it's missing out things which they see on their screen. My idea is first to construct a timeline is to include all of the tweets of the user plus all of the tweets that they see from other users who they have their following. So this is what I mean by attract a complete timeline. While avoiding the API rate limit, well, again, we're not really going to discuss the API rate limit. They exist, but again, most of the code that you use will have ways of getting around it or waiting for it and what have you. So it's not really an issue we need to concern ourselves with today. If you need to write code to get timelines, I get a lot of tweets in a short period of time, or the shortest period of time is possible. There's lots of code examples of how you manage the API rate limits which are imposed by Twitter. Basically, all you do is you wait until it fails. It says you fail because you've reached a rate limit. It will actually tell you at what time you can commence asking for tweets again. So it's a very simple process of avoiding the rate limits. You don't really avoid them. You just run into them and then wait and then run into the next one. Having collected all of the data that we need, the next thing we're going to need to do is actually do stitching of things together because we're going to have to collect the data in various little parts. Even for your own timeline, it comes in parts that we're going to have to put together in order to make anything usable at the end. I'll mention conversations rather briefly towards the end because, again, it's something which is made a bit better with the new API, but it still leaves a lot of work for you to do in actually putting things together should you need to do so. In some use case scenarios, you may not need to bother with that, but in others it will be quite important. And although you get a bit of help with the new API, you're still going to end up having to stick things together. I'll give an example of that later on. And then at the end, having got it all together, well, what are you going to do with it? Because quite frankly, when you've constructed this timeline of what the user sees, it's really just a very long list of records, a very large data set. I'll show you the figures later on, but if you take the UK data service, if I ask for just the timeline of the UK data service, I'd get 3,200 tweets. So that's 3,200 rows. But the data service follows over 3,000 other accounts. So when I've got, if I had the complete timelines of all of them, I'm going to end up with well over a million rows of data. And that's not something that you want to look at on the screen. So at the back of your mind, before you start any of this, you really have to have some kind of idea of what is it you're going to use this data for. And I'll give you a few little minute examples at the end. Cleasing authentication has been simplified with the latest API. API version two started, I think about January last year, but API version 1.1 is still commonly used. API version 2 has simplified things slightly in that it uses a thing called a bearer token for access. Makes it a bit easier, but you still have to get one before you start, and before you can even get that far, you still need a Twitter account. So if you think you're getting Twitter timelines, make sure you've got a Twitter account. Any request you make to the API, you've got to include your, in this case for version 2, the bearer token. I've just mentioned, yes, currently version 1.1 and version 2 are both running together. I don't know how old 1.1 is, it's quite a few years old now. Version 2 is still being developed, simpler to use. They've got quite a good little website telling you how to use it and code examples and what have you. And I noticed in the last couple of weeks, they started boasting how they've now got, in addition to examples in Ruby and Python, they've now got some R examples as well. So depending on what your programming language is, you should be able to find some code to help you out there. I won't go to assume, it's not really relevant because as I said, I'm not going to show you any code particularly, but most of the work, everything, the results that I'm going to show you have all come from using the version 2 of the API. So it said, doesn't matter for today. Just on the old system, it's still around, you can still use it if you've got version 2 keys. If you have old keys and legacy code, that's still going to work perfectly fine. I haven't seen anything about saying when it's going to shut down. I would imagine that it will run for a couple of years more yet. The downside of the new version at the moment is that some of the programming language packages haven't been updated for it yet. Some have, some haven't, but you can always fall back on using basic API calls from virtually any language you like. What do you get in the new system? You get a nice little dashboard which tells you how many tweets you've downloaded on a monthly basis. I guess part of the reason to do that is because your account is limited to either 500,000 tweets per month, or if you have an academic account, which I would recommend, you can have up to 10 million tweets per month. That might seem to be a very large number until you start downloading people's timelines because it involves an awful lot of tweets to download. This point is only really relevant if you've used version 1.1 because some of the endpoints have changed and there's changes to the way the data is delivered. The main difference between the way the data is delivered is that in version 1.1, if you ask for a tweet, it tells you everything about that tweet. Anything it had on that tweet, it was sent down the line to you. Whereas on version 2, you have to say what data you want returned. By default, if you ask for a tweet, all you get is a tweet ID, the created time, and the text of the tweet. The text of the tweet itself is a bit misleading because it's not necessarily going to be the full text of the tweet, but we'll go into that later on. From a position where you could just ask for the tweet and then decide what you want to extract from them, now you're going to have to plan a bit more ahead and decide what it is you want the tweet to return because you can certainly get into a situation where you haven't got all of the information that you're potentially going to need. Just as a little example, this is taken from an old... This is from a version 1 call to the API and returning a tweet. This is from a guy called Donald J. Trump. It's not really about Donald J. Trump. This is about all the different fields that you can get coming back. A lot of these fields, or half of them, are quite useful and almost essential. You'll almost certainly want the created ads. That's still available. The ID string, which is the unique identifier for the tweet, which you're going to use, not in its own right, but should you ever need to go back and get this tweet again, then that is how you would reference this tweet. You get this thing... This has changed slightly in the new system, but you used to get the thing called full text, and here you seem to be getting the text of the tweet. Now, as this is a short tweet, you do, in fact, have the full tweet text there. But there are situations where it'll say full text, and in fact, you haven't got the full text. If this was a retweet, it would start off with RT, followed by whoever the original tweet came from, and then it would have the text of the tweet. If you think about it, if someone has used their limit in a tweet of 280 characters, and then someone retweets it, the retweet information goes on the front, and some characters are going to get dropped off the end. We'll worry about how to deal with that later on. There are other things which are quite useful. The entities can be quite useful. I'll show you an example of using the hashtags later on, and the reply tools are useful when you've got to reconstruct conversations. I said that's made a little bit easier nowadays, but previously that would have been essential stuff to have. But when we get... Oops! As I get down towards the bottom, you get some stuff and you think, really, why on earth do I want this? Profile text colour. Profile background colour. There's a whole host of things in any tweet, version 1.1, which really you'd struggle to find a use case for. Vastly improved with the latest version because you ask for specific things. I don't think there's actually any way of asking for the profile background colour. That's almost disappeared completely. But all the important ones which I've been mentioning, are typically available and easily accessible. I mentioned academic accounts. How they differ from a normal version 2 account is your 50,000 tweet limit goes up to 10 million. But like I said, don't get carried away because you can find yourself filling that up. Perhaps a better perk is the fact that you get access to the full search capability. In the previous version, if you wanted someone's timeline, you got 3,200 most recent tweets. If you wanted to search for a term or tweets from a specific person, you could do that, but it would only search back seven days, which is clearly potentially a problem. Now, if you've got an academic account, you can actually use the full search. Given that you can couple that with start times and end times, you can effectively extract tweets going back to whether 2006 or whatever tweet has started. Sort of downside, but expected, I suppose, is that you need to apply and be accepted for an academic account. How do you go around doing that? I keep mentioning, yes, you do need a Twitter account before you start. There's a relatively short online form, three or four pages, I think it is, and then you have to wait for a day or so, I would hope that you get one. I don't know what the acceptance criteria are, but you ask about yourself, your own academic credentials and your project. The idea is that you have some project already confirmed or in mind going through approval and something like that, and it's going to need social media data and Twitter data, in particular, will be very, very good. Effectively, you've got to make the case of why your project would be better with the Twitter data, but that's not really very hard to do. There is, as part of the dashboard, in addition to telling you how many tweets you've got left in a month, there's a little sidebar which gives you links to tutorials and things like that, and one of them is a forum for the academic account, and there are some messages in there saying why have I been refused an account, so it's not impossible to get refused, but I've got to admit, when I said wait a day, I only had to wait a day, and I got my account. Applicating progress online age, I was going to show you this. I'm not going to bother showing you. I've already described it. You go to the I Wanted account, it says, oh, are you an academic? You say, yes. Oh, would you like to apply for an academic account? Yes, I would, and it just leads through these three or four pages that you fill in and send off. Okay, so let's assume now that we have our account, and we want to look at timelines. So, as I said at the beginning, you've got to distinguish between what the API says is a timeline for a user, i.e. 3200 tweets, and what we think of as a timeline, a chronology of what a user sees on the screen. So, four things we need to consider. What is included in a timeline? What is not included in the timeline? Here, I'm talking about the timeline that we can get hold of. What is not included but you can get, we can add it to the basic timeline, and what's included and you cannot get. If you just bear with me a minute, I'm going to use this technology to pause the sharing and start a Twitter account. I've just opened up my own Twitter account, which I needed the keys, and what you can see is anyone see if they open their own Twitter account. So, what I've got here is a retweet from the UK data service because I followed the UK data service. We've got experts in money, which is some kind of advert, nothing to do with me. I don't know why I'm particularly seeing it. That's down to Twitter. I've got some more UK data service stuff. We've got some ONS stuff because I followed ONS as well. I don't follow very many people. What it will offer you in Twitter is who you might want to follow. So, this is Twitter's idea of who you might be interested in. UK statistics authority because I'm already following the ONS and so on. Another advert from Vodafone, and so on and so forth. I can go down here as long as I want to do. Now, the point about this is that any of the tweets from someone I'm following, if they have retweeted them, I will see them. If there's a tweet from the UK data service itself, which I think is all down here, I get them as well. The ONS ones I get. Of course, should I actually tweet something which I virtually never do. That would appear in here as well. So, the easiest things to get or things that you can get are the tweets that's from your own timeline, things that you have tweeted yourself. You can see the tweets from who you're following, their timelines as well. You won't actually see everything that the UK data service sees. You only see the ones where they have retweeted it. Because a retweet is essentially becomes one of your tweets and I'm following them so I see their tweets. But the closer account, I don't follow that. So, I'm only seeing this from closer because the data service has retweeted it. But it means that the data service is almost certainly following closer. So, as we go on, when I get the list of friends I can get the entire timeline for closer and I can work out what the data service would have seen. The other things obviously we're not going to be able to get at all are things like the adverts. Again, no doubt they've got some clever algorithm which decides they think I'd be interested in experted money. And also the who to follow. Again, this is internal to Twitter. These aren't tweets so we can't get them. Adverts we can't get, tweets we can't... who to follow items we can't get. We can only get the things which actually correspond back to an actual tweet. Whether it's an original or something's been retweeted and you've seen this because you're following them. So, what we're interested in, what isn't included but you can get i.e. the other tweets which aren't retweeted from your friends and what is not included and you can't get that adverts and the recommendations from Twitter. It's a bit of a pain because they'd be quite useful. That's what you'll see what you do see when you're looking at your own timeline. What you can't get are the tweets that your friends see and their recommendations and ads which again, may be potentially interesting to you but can't be got. What you can get, we can get all of your own tweets, that's kind of obvious I think but you can also get all of your friends' tweets as well by extracting their timelines. What you can't get are deleted tweets anything from banned accounts the adverts and the recommendations that have been covered then. Let's have an example of Twitter feed the example of Twitter feed that we're going to use is we're going to look at the data service or we'll have actually just looked at that. I've mentioned software stuff before we're not really going to be talking about software but I've mentioned for example the Python Tweepy which is very popular for Python users as far as I know that hasn't been upgraded yet and this other one Twitter R is very popular in R Twitter Python twar I've actually never heard of but I think there has been updated and if all that has failed you can always use the raw API with using Python requests or the equivalent in R and I'm sure there's plenty more possibilities in there as well. We're going to look at the UK DS timeline, why not because it's publicly available and basic status on stuff I've already extracted from them I can tell you that the timeline consists of 3,248 tweets or it did on Sunday I think when I downloaded it if you look at the documentation for Twitter it says 3,200 tweets but the reality is you tend to get a few more I assume it's some kind of little batches and you get the batch which runs over 3,200. What you can tell immediately from this if you look at the tweets they come back in reverse chronological order are you the most recent first so if I go down to the last tweet that came back I can see that it was in 2018 on the June the 5th format of the time are pretty universal standard formats and from knowing that and how many tweets I can work out simple arithmetic they're doing an average of 3.1 tweets per day now the tweet rate itself you may think well it's a significant statistic I doubt that but there are situations where tweet rates can actually cause you a bit of a problem when you're doing this not so much now but certainly an earlier version of the API it was a problem just more about what I extracted directly from the timeline in fact this information here comes from the user information about UKDS which again publicly available and although it's not actually part of the timeline I would recommend that if you are getting a timeline for all of the friends of the data service it's worthwhile getting their user information as well because not least it will give you a description of that user if they filled it in and it will also tell you the same information I've taken here from the data service how many original tweets they made retweets quoted tweets and replies there are four categories of tweets now most tweets are retweets and original tweets the split between the retweets and the original tweet will depend largely on the type of user I mentioned before I don't think I've ever tweeted or very rarely tweeted but then equally I never retweet either individual users are far more likely to offer original tweets and might occasionally retweet something if you find it particularly interesting if you take an organisation like the data service or the public service type of organisations the chances are they are going to yes they will have their own set of retweets a big part of original tweets for example telling you about webinars and all this sort of thing in the case of the data service but a lot of what they do will be retweets so finding that they've got this ratio of about three or four to one is perhaps not unusual because part of what the data service is doing it is providing a service to the people who are following it and so they will have 10000 odd followers they will have people like the ONS and lots of other data services and what have you and they will selectively retweet some of those to provide their followers of which are about 10,000 information so they don't have to follow individual people or that's the idea it's a service that they're providing so having a lot more retweets and original tweets isn't really a sign of them a lack of original stuff to say it's them actually providing the service that they do so the significance of the tweet rate the higher the tweet rate and this is common sense I suppose means that the shortest time span of something in the timeline so if you're limited to 3200 both recent tweets the faster the tweet rate the less amount of time you'll actually have in the timeline and then for things like BBC and news agencies it could be as short as a couple of weeks so the consequence of this used to be at version 1.1 is that you have to be able to collect them promptly because if they fell off the end got pushed off the end by new tweets you couldn't really get them because if we're seeing two weeks if they've been pushed off the end even doing individual searches it wouldn't help you at 1.1 because you can only go back 7 days so the timing of the collection of the tweets can be quite important not so much now because if you've got an academic account you've got access to the full search API and you can go back as far as you need to go so you can collect them retrospectively if you need to on the new system a specific example the DICE project which is a project I worked for or I worked for we want to get the complete timelines from 905 survey respondents who've willingly given us their Twitter IDs so when you look at the 905 complete timelines that makes gives me a total of 300,030 unique friends timelines because there was in fact a total of 450,000 but of course some of them have the same friends so the first step in such a situation is to download the same timeline twice but even at 300,000 it took me over 16 days to collect all so going by that previous slide if the last one to be collected was a news agency which is cycling over every 14 days I wouldn't have the complete timeline it would have completely been overwritten from my starting to collect them to when I finished collecting them and there's not much to do about that other than having multiple people trying to collect them at the same time like I say, not so much a problem now version 2 academic account because you can specify specific time periods with a given account right, collecting the data what we need to collect is the UK DS timeline itself which we get we can get directly we need to get a list of the UK DS's friends which again you can do directly using the API and then we get to get the timelines of all of the UK DS friends I mentioned before although it's not actually part of the timeline it is useful to have the user details of the data service and of the friends I've got an example of what you actually get in the friends which I'll share with you in a bit ok so hopefully you now see my little UK DS friends spreadsheet and basically this is just a flattened version of what comes back when you say give me all of friends at the data service house if I go down to the end oops go down to the end there you can see 3,356 of them go back to the beginning you get an idea of what information you get so right at the beginning here I've got the IDs these are the Twitter IDs of the friends and I'm going to need that list of IDs in order to get the timelines or based on you ask for the timeline for the ID you get the description now this is user provided information to us whatever they put in there most of these are filled in some people you don't have to put it in so you don't have to notice this one here currently off Twitter so lots of other little information you get the followers count you get the following count and you get the tweet count so the tweet count can be used to work out the tweet rate I've done this in yellow because this isn't provided this is what I've included into this table but you can work out the tweet rate so you can see up here I'm pretty sure this is the BBC news user and they've got a very high tweet rate this is 86 tweets per day on average other little bits of information you may or may not want to use probably nothing much else in there but some of the basic stuff like the count and the descriptions are very useful to have the IDs are essential because that's the basis of getting their timelines this is just an example of part of the data servers so we've got a rough share which of course user information which wasn't in the list of friends but we want this as well so it's exactly the same thing and this is sort of how it comes back from the API so you can see here I've got the followers count and I've got the following count which is the friends and various other little bits of information so I used the 10,535 to work out the to work out the tweet rate oh just a little more word about Trump as we just happened to come across one of his tweets before okay what do we know about this guy called Trump? Controversial Twitter I've always wondered why people who why we refer to tweeters and not tweets but none of the mind but controversial tweets yes but that's that's just free speech that's okay you get warning started getting warnings on some of his tweets and that's really the opinion of the fact checker or Twitter themselves but what's possibly more important about this is we've got no way when we collect the tweets of getting that information if you were looking at it's on a timeline if you followed Trump when those tweets came up and they put warnings on them you would see the tweet and you'd see the warning but now but if you're using the API all you get is the pure tweet not the bit which is like overlaid on top of it if a tweet is removed then you can't see it and you can't download it from the API so if you try to search for even if you've got the tweet you knew what the tweet idea was it wouldn't help you it wouldn't come back in the search it doesn't come back and say oh no that guy is banned it just returns nothing and when he actually got banned then he came a non-person to Twitter so you can't download any of his tweets even the ones which were just mildly controversial he's a non-person to Twitter and this actually has another effect which you may not immediately think is obvious but if he's a non-person yes it's obvious that you can't get his tweets but his idea is removed from the friends list of his 85 million followers so what that means in terms of the project I'm working on our 905 volunteers if you like where we had their Twitter timeline and we had their list of friends if any of those were following Trump given that we didn't get this list of 905 until mid January after Trump had been banned his idea was removed from their list of friends so when I first got these list of friends and I said oh I wonder how many people are following Trump I found that none of them were because they'd all been removed which meant we had to get the list of 88 million and then try and put them back into into his 905 where it was relevant so that's a bit of a pain and it's certainly not something I think we were all expecting Trump to get banned at the end but we didn't realise that this would be a consequence of it not being able to or having his idea removed from his list of friends obviously if you downloaded Trump's tweets before you was banned you still have a copy of them brings us back to what data do we want to keep as I've mentioned in version 2 of the API this is a decision that you have to effectively make in advance because apart from a couple of defaults you have to ask for anything that you want I mean it's not individual fields they're groups of fields but you still have to ask of them whereas I said in 1.1 you used to get everything and it was really a case of deciding what you wanted to get rid of after the fact if you like but in both cases there are some key fields which you're almost certainly going to want to keep so here's a little table of some of the fields I think these are actually all of the well these are all of the ones I collected from the version 2 API I could probably have had more if I'd wanted them but I didn't so again we've got the usual suspects of the text yes the author ID the account that generated it the ID itself created that the conversation ID that's a new version 2 and I say we can use that to re-collect all of the tweets related to a specific conversation so that conversation would cover things like a multi-part text a tweet so when you split the tweet into three parts or something like that then they'll all have the same conversation ID so you can easily get them all back together well you can collect them all and put them back together yourself the reference tweets this is telling me that the tweet itself refers to another tweet other things which you're likely to want are things like the entities the mentions and hashtag entities because they can be very useful for doing analysis of various types the public metrics these are the ones which I showed you before the retweet counts well I showed you the equivalent of these for the users these are the values for this particular tweet that's come back entity URLs again that can be useful there's a little bit of a gotch on that which I'll cover later on any part of the user ID yes again you might want to have that so some are definitely essential quite a few you'd want but overall this is quite a manageable list compared to what you used to get so I think I would tend to just take them all and ignore them if I don't need to use them next step is timeline reconstructed how are we going to do this step one we're going to get the target timeline that's the UK data service that's a single call saying I want the timeline from this user we can also get the list of the data services friends again that is a single call that's changed slightly from version one and version two you used to just get an ID for each of the friends now you get a little bit more information so it's a little bit more involved but essentially you need the list of friends to get the IDs to go on to step three we're going to get the timelines from the friends and that's essentially exactly the same step as step one except that we substitute the data services ID with each of the friends in turn the next thing we want to do is I'm going to come back to this point in a minute combine the data and the include sections of the timeline data when the data comes back it's broken down into various sections the majority of the data information that you want is going to be in the data section but the includes section is also needed especially if you've got retweets and as you don't know in advance if something is going to be a retweet or not you're always going to ask for the include section to be provided having combined the data and the includes the next thing we want to combine by appending one to the bottom of the other of all of the timelines and by all there I mean all of the friends plus the target timeline of the data service as well now as I said step three here is exactly the same as step one except I've changed the ID the combine that's going to be the same for everyone this is going to be the same for everyone so everything will combine quite happily together because they're all exactly the same format because it all came back from the same types of core and then you probably want to sort by the date or the ID to put them in chronological order you can have them forward or backward it doesn't really matter because the ID will both work because the dates are unique and the IDs will be unique that's the tweet ID here going back to point four combining the data includes sections if a tweet was a retweet then you need the data in the include section to get the full text of the tweet I pointed out when we were looking at the tweet data that the thing it says as text or full text is what it thinks of as that particular tweet but if it was a retweet what the real full text is is the text of the original tweet and the original tweet will be included in the include section so also what you'll get is an indication of whether or not it was a retweet or not so the idea is in order to collect the full text of the original tweet you need to say to yourself if it's a retweet then I want the text taken from the include section if it wasn't a retweet then the text from the original tweet will be the full text okay and that's exactly the same very similar to what it was version 1.1 as version 2 the actual approach is slightly different but it's the same logic if it's a retweet then take it from don't trust what it says in the full text of the tweet go to the includes or wherever and it's the same argument if you're going to be dealing with the entities so these are the hashtags the mentions and URLs if it's a retweet you need to get the entities entities from the include section rather than the actual original tweet because these will have the full set of entities rather than what was available in the rather truncated tweet timeline itself it's just a chronological list of events I'll show you what the timeline looks like because it's incredibly boring so this spreadsheet is the reconstructed timeline of the data service going up to the 16th of the 4th in fact when I did the friends I didn't do the complete friends timeline I just did it for the last previous week but down here I mean all of these if I move down here you'll see the author ID X here the X is just because I've combined two data sets together don't really mean anything but you can see here the author ID it changes as I go up and down because I'm starting off with the data service it's probably easier if I show you the other end where I've got the actual names I think so you can see here I've got a user name for the data service and as I go down here oops you can see all of the friends I've collected tweets from yeah now all of these individual tweets would have been seen on the data services on screen timeline but only the ones that they retweet would their friends see on the timeline so this is the complete picture and from this we can decide which ones did they retweet and which ones did they not retweet that in itself might tell a whole story depending on what you're trying to do but this is the complete list now if you consider here I've got the complete data service timeline and about a week's worth of their 3,000 friends timelines you can see I've already up to nearly 6,000 grows here so if I've had all of the data from all of the timelines it would be a very large data set I was this is how I was going to show you these are the friends tweets and you can see here the tweet count I think when I just do the tweet count in smallest to largest order you can see that there are some people who the data service are following who have never tweeted this is actually quite common people have accounts and they never actually use them and if I do it the other way around as I mentioned at the beginning some of them of course like the BBC are very prolific in the amount of tweets that they do so even though you're collecting someone's timeline the main fact be nothing in it so chronological sequence of events and the really the next sector it's up to you it's going to depend on what you're trying to get out of this so I've just got a few little examples I think just before we move on conversations and responses I've mentioned these before if you're going to do any kind of text analysis you'll probably want to make sure that you've got the conversations are recombined conversations don't just cover I said he said it can also cover multi-part tweets which will allow you to help you put them together because if you're doing kind of text analysis I assume that you would want to work on a complete tweet rather than three parts of the same tweet and so it does help you with the conversation end point but you still have to do some of the work of assembling it yourself timeline analysis what you want to do what data to do it do you need additional software or data now the first start I think we've probably covered well you've got to know what you've got to do kept all the data to do it well in terms of the Twitter API as I say version 2 you're probably better off just collecting everything and keeping it because it's not nearly as bulky as version 1 do you need additional software or data well if you're doing timeline analysis yes I could do little graphs changing tweet rate over time things like that but it seems more likely that I'm going to want to do an analysis of the timeline based on external events which are known to happen so for example I don't have the example but if we were looking at tweets on a given not so much a person's timeline but tweets on a given subject like some election type subject but if something happened in the news about that subject you'd expect to be a spike in the number of tweets and it's the same for this sort of thing so for example if you were just looking at someone's personal timeline you might get a flutter of tweets coming and going around their birthday or Christmas and things like that but you'd have to so the question is could you work out their birthday well everyone does Christmas but could you deduce their birthday from the timeline of a tweet rate and things like that of course you'd have the data as well the text as well that makes it easier but the other things where the text isn't going to help you and you're interested in how tweets change over time so for example if someone one of the people you were following started tweeting a lot and retweet what they say or not things like that change a tweet rate yes trending so I've sort of mentioned that external hashtag so that's the example where you'd need additional information to know what was actually trending to see if it's reflected in the timeline that you're looking at retweeting new and different sources I've just mentioned that one as well sentiment analysis so this is dealing with the actual text of the tweet personally I have a lot of difficulties with sentiment analysis on tweets because of the following reasons one you've got very limited text even sentiment analysis it's tricky with the limited text when you consider not only if you've got a maximum of 280 people are putting hashtags mentions and URLs in there as well so you've got less text work with then you've got things from emojis how they treated and then we've got irony and sarcasm which I think is I've yet to see anyone have a good system for dealing with irony and sarcasm when it comes to sentiment analysis oh small demo quick pause when I bring up the demo okay I said you're going to be small demos so what I've got in this spreadsheet I've got the text of the tweets either the data service tweets and what I'm using is a little excel add-in if I go to insert add-in I am using this as you machine learning up here if you haven't got it you can just download it it's free to download and you can use this to do simple text analysis now it's a bit flaky if I'm honest so I've already done it in advance rather than risk it going wrong but the idea is that you've got the tweet text in column one and then you run the application down here and it will come back with a sentiment in terms of neutral negative or positive and a score and basically I think 0.25 is negative 0.25 to 0.75 is neutral or thereabouts and then anything well obviously not quite that but any higher is positive so it's great it's calculating a score and then it's deciding whether it's neutral, negative or positive okay now the problem I have with this is I just don't believe it because if you look at some of these if you read some of these I'm not going to go through them but if you actually try and read these and think well why is that negative, why is that positive and when you get to things like this one down here where half of it is mentions and URLs and what have you you think well how can it possibly draw a conclusion well it is sitting on the fence in this particular case but you get the idea it can be very hard to do sentiment analysis I think with Twitter data so the next thing whoops we can do basic statistical analysis on our timeline so for example how many original tweets versus retweets if you remember at the beginning I told you that in the the UK data's timeline itself I broke that out for you the user information how old the account the number of tweets, friends and followers well again we've seen all of those the user description can be useful tweet rate over time again you can do that in a time time analysis we can do the relationship between mentioning hashtags and known trends or events I sort of mentioned that as well a minute ago so I think I've got a couple of graphs here which I've done using my hashtags so here I've taken the top 10 hashtags and this is the UK data service only so this is before I added in the other their friends and so unsurprisingly UK training, UK DS webinars or the largest hashtags because of course they happen almost on a weekly or multiple multiple times a week basis and below that much smaller we've got various other things UK COVID data dive we've all got data in a loved data week, identity in data so on the UK DS health 20 in health 19 these are I think annual conferences they have and we've got 19 and 20 in there because if you remember the timeline I got for the data service goes back to 2019 now the point about showing this is really really compared to the next one these are the top 10 hashtags seen by the UK data service so these are the people they are following i.e. their friends what they've got in their hashtags and we can do that only for the week of data which are collected for the friends but we can do exactly the same process and see what they're all talking about and you can see just looking at these names down here, BBC papers tomorrow's papers today BBC football BBC breakfast you can see this is reflecting the fact that the BBC tweeting awful lot and they're all being picked up by the data service but if you're following the data service you'll only see the ones that they choose to retweet so the next step I could have done is how many of the retweeted ones have the data service or hashtags for the retweets of the data service oh and that's exactly what I seem to have done so now we're back into data service type territory because what they choose to retweet are well COVID-19 we'll just accept that but things relating to data says there which is related the health 19 got in I don't know why the health 20 didn't get in they stopped advertising it but again everything's got data in it because that's what they were a lot of longitudinal studies in the data service so that's well represented as well now okay we sort of know that for the data service the type of things they do but this is quite useful for where you're not quite sure about what interests a particular person or organisation whose timeline you're looking at the warning about the URLs now I've mentioned here I've used hashtags I could have done exactly the same with mentions the little warning about the URLs they're provided in exactly the same format as the mentions and the hashtags are except you get a little bit more information starting then they're just a position in the text which is totally irrelevant really the URL is the short URL they've obviously got their own little system which puts in 2.co and then URL and then you also get the expanded URL okay so if I was to put this into my browser and that into my browser I would get exactly the same thing okay now the problem with this is is that the short URL is different for the different people who generate it so previously I could do a count of how many COVID hashtags there were trying to do an account of this URL would undoubtedly find it only there once and similarly all of the other short URLs probably only appear once whereas if I did a count on the expanded URL I may get more than once because every time a URL is shortened the actual hashed version of it depends on the user and the browser and possible application that they're using so you can't use the shortened URL in the same way as we can use the hash tag of the mentions you'd have to remember to use the expanded URL network analysis this is probably very popular because you can avoid all of the text stuff it's often done for friends and followers and friends of the followers and so on but that gets out of hand very very quickly we can do for accounts a hash tag which I'm going to show you in a minute there's lots of additional software products supporting network graphs and social networks is a subject in its own right you probably know that so let's just go and show me another little trivial little example okay so here I've got my little spreadsheet up and what I've got in terms of the data I've got usernames for data service and BBC news so it's data service and friends and here I've got the hash tags and I've limited this to the data range for which I collected the friends data as well to make it a level playing field what I want to do is draw a little graph of username versus hash tag again I have oops nothing there in my add-ins there's a free add-in called digraph which I have it's free to get if you haven't got it and you want it and it comes up and it looks like this it's quite smart cos it expects two columns and that's all I've got and it works out column one is called username yes column two oh that must be the hash tag one I'm saying yes and then it'll go away and create a little graph for me it's a directed graph I think there may be a setting to change that but for what we want to do it doesn't matter fortunately it's very very faint but we can expand the graph and find sections of the graph it's a quite good section if I make this a bit bigger as it grows you can see this is a pretty simple tool and there are far far better versions of this sort of thing available so as I home into the centre of these graphs you can see the data service here and if I go a little bit further you can sort of see all of things so data is very prominent student, retiring this is really just over a six day period or six or seven days so you can see the things really hashtag the use of that period and if I try and move the graph over here you can see Covid-19 all of the arrows pointing to Covid-19 because not only would the data service have mentioned Covid-19 but probably the BBC and lots of other things mentioned it as well so again this sort of scale and this little application to excel it is obviously a bit limited but hopefully you can see the potential of what you can do with network social media graphs also the actual thickness of the line and colours on bigger system you can change so instead of having a line you can use the thickness of the line perhaps to indicate how prominent that connection is lots of things can do with network social media graphs okay find a little bit products producers and consumers I've sort of mentioned that different people have different ideas how they want to use Twitter so the way I think of it and this is just entirely my own interpretation you've got who uses Twitter, well friends and family use Twitter and the chances are they're going to talk to each other hobbies and pastimes gardening or fishing or whatever and there's bound to be some Twitter sites that you can go to and get tips from and all this sort of thing moving up the scale we've got the commercial and public services so this is where the public services this is where the data service fits in but lots of perhaps local government and things like that would fall into this category as well we've got the news outlets which as we've seen the BBC is very prolific Sky News will be the same and all the others political organisations who are very keen to put their message out but of course they have very much have lows highs and lows in terms of when's the next election due I suppose so just a little again this is entirely my own thinking of what the categories might be and if what direction the tweets are likely to go so friends and family you're going to have lots of original tweets and very few retweets political organisations from your point of view you're not likely to talk to them very much unless you're particularly politically kind you're not likely to send them a retweet whereas on the other hand they're going to make lots of original tweets very much time scented as well i just before an election and they're going to make lots of retweets because they will plug anything which seems to favour them the rest of the term is just various combinations of what is likely to happen and that's sort of important because it affects the Sky News that's not the right word but it's going to affect the waiting of one type of tweet in a complete timeline as opposed to others so if you take the example of the dice situation where we've been we've been following political organisations in the US and we've got our 905 ordinary people well the timeline of those 905 is going to be very different to the political organisations and the news outlet which we've also followed which we're going to effectively have to combine together into a single timeline for any one of our 905 we've got to take them if they're following a political organisation we need that timeline if the following news outlets we need those timelines as well and we need to put them all together and then interpret what we find so in summary you've got to have a plan especially now where you've got to say what you want in advance you also want to have reusing a plan could be as simple as changing the account name as I pointed out the steps you take it's exactly the same for the original timeline as it is for all of the friends in that timeline or for that person in timeline you've got to remember the timeline is just the data and it's a very boring data set really you have to turn it into information so decide what a date you want to do and make sure you've got all the right bits for that there's much pre-built software to help with this so all the things like I've mentioned you get packages which will help you with the API download stuff for you get around the rate limits if you want to do sentiment analysis or whatever there's lots of text analysis programmes out there you can send it up to the cloud ASUN and AWS and there's even more social network applications like NetworkX in Python Geffy which is a free programme you can put on your desktop