 Hello everyone and welcome to this UK Data Service webinar investigating demographic representation on Twitter. I'm Marguerita Serralo and I'm a Senior Communications and Impact Officer working for the UK Data Service. Presenting today is Luke Sloan. He's a Senior Lecturer at Cardiff University and the Deputy Director of the Social Science Lab. Hello everybody, it's a pleasure to be invited to talk about how this has become my pet topic. My name is Luke Sloan, I work at the School of Social Sciences Cardiff University and for about three or four years now I've been working looking at Twitter data. I mean when this first start in Twitter data was quite new to the social sciences and we were working on the savage burrows of coming empirical crisis and the idea that the social scientists needed to reinvigorate their methodological toolkit and think creatively about how this type of data can be used to understand social phenomena. I remember at the time the debate was always whether there's enough information in Twitter to do anything with it. So from that point my stance has always been that Twitter data can be made more useful by understanding demographics which allows us to ask more traditional social science questions. And for some platforms there are demographics present and for others there aren't and Twitter is one of those ones where there are signatures of demographics of characteristics of age of gender or sex and of social class and occupational hints. And we just needed to find a way to get them out and at Cardiff we've got very good collaboration with the computer sciences as well and together we found a few solutions. So what I want to talk about today really is some of the solutions around for estimating demographic characteristics in third if you like from actual Twitter data rather than from the survey. They're not perfect but some of the evidence I'll show will suggest that we are classifying people correctly and there's been some more recent work trying to cross reference these with Ambrice Social Attitude Survey 2015 and I'll talk a bit around that as well. So my fundamental stance is that I've done lots of work on Twitter including trying to predict the general election. If you want a story of the methodological roles of using Twitter to predict elections I would highly recommend that paper which is basically full of us saying we tried but it didn't work and here's why. I'm also looking at lots of other things on the Roche Assembly election and the Horace Student Scandal looking at crime sensing through social media methods of crime disorder terms linking it to prevalence of crime and so on. And all of these things can be enhanced by understanding demographic. Voting behaviour, intention to vote in relation to whether people actually vote is determined by demographic factors. Being victim to certain crime is determined by that as well. The Horace Meade Scandal affect different demographic groups in various ways. For example the person that normally does the shopping in the household which is typically female that may have different concerns to a male Twitter about Horace Meade. It may or may not matter. In addition age as to whether you're of the age where you may have a young family or whether you're retired or whether you're young would determine your reaction to food scares as well. And anything, all of these things benefit from knowing about some of the demographic characteristics. So on the basis of social science primarily is interested in group differences and how they affect social behaviour. We want to be able to use Twitter data to explain social phenomenon to understand how they manifest in the virtual world. But there is nothing systematic. Twitter has signatures of various demographic characteristics and various behavioural characteristics which are mediated through the technology in the platform. And one thing I'll say at the start and I want to say it very clearly now in case I don't make the point enough during the presentation is that the nature of the technology and platform in this case Twitter has a huge mediating impact on how demographic signatures are picked up and how identity is expressed. So with all of this we're looking really at how the real world is manifest in the virtual but there's a lot in between about identity play about deception and sometimes just about people genuinely putting a typo. So if someone says in their profile that they're 300 years old and they're meant to put 30, not always deception, sometimes genuine mistakes. So overall the general research questions that drive my interest so what insights do demographic proxies of behaviour on Twitter? And then to kind of test that if we think we can identify the demographic characteristics of individual users then we would expect based on that classification to see differing patterns of behaviour and I'll show you an example of sentiment scores during the Olympics broken down by people we think are male and female which suggests that the differences that we've identified the categories we've identified rather are very real. And then that leads on to the real world demographic differences manifest in the virtual world. So issues around social inequality based on social class do they manifest online? Do people in the higher relative term but the higher NSSEC so social classes have better networks or actually is Twitter a democratising medium that allows people who are typically disenfranchised to create a network of their own. Does it distort traditional understandings of networking connectivity or does it just reproduce them in a virtual environment? So there's lots of questions, lots of things we can do if we can get hold of this demographic characteristics. So I'm going to talk about gender a bit about location, a bit about age and a bit about occupation. So one of the prime demographic characteristics of interest to social scientists is understanding whether someone's male or female. Now I will use the term gender here because we're looking for virtual representations as opposed to sex and the two are quite different in that sense. So the way we went about this very simply is looking for the first name of a user on Twitter. So when you access Twitter data people know, people don't know how to do this so I pause it and tell you something you already know but it's quite critical is when you access Twitter data from either a live feed or if you buy historical data you don't just get the tweet you get a whole load of information such as the profile description field, how many people are following links to profile images and so on and so forth. It's called a JSON file, normally that's the format it comes in and there is a lot more information itself and that's actually in this metadata that a lot of the information that's of interest to me exists. Things like whether someone has geotagging switched on. I'll cover this a bit later but if someone switches on geotagging we know the exact points, the latitude and longitude to the meter of where they were when they made that tweet assuming it's from the mobile device. So when we think about Twitter data how useful it is its utility is linked not just to the content of the tweet but everything else that comes with it, the metadata increasingly it's becoming mainstream data. So one of the things we get is the name of the person simply we can try and predict someone's well not predict but categorize someone's gender into male, female, unisex or unknown based on that name. You can look at the paper reference there if you're interested in this, there's a database of 40,000 names and it has whether it's male, female or unisex and essentially if you do some cleaning up of the text if you teach some simple rules and this is why it's so important to work with computer scientists and people skilled in the technology and the tool building for doing this then you start to classify people and of those we could identify as being male or female the split is on Twitter of the UK population as far as we can identify them is 48.8% male 51.2% female and that is very close to the general prevalence of male female in the population according to the 2011 census of 49.1% for men and 50.9% for women. Now what I will say is that the recent work that I've done looking at the British social attitudes 2015 survey we had a question on Twitter use there and that is a random probability sample survey with waiting to make it nationally representative and it demonstrates the starting to hint that actually there is a disproportionate number of men on Twitter so about 52% male. That means a few things we can be reasonably confident of that estimate so it suggests that men may be more likely to use Twitter than women in general or it might mean that there are more male names we can identify and we can think back then to what that means it could involve identity play it could mean that female tweeters are less likely to put their name in their profile what we're hinting at here is some difference in behaviour some idea that gender is an important factor in understanding Twitter use even though the difference between the population at large is quite small. So as a lot of the things I identify the demographics I think they generate more questions in the answer and I would really encourage people to try and go away and answer those questions. I'd love to know why for example we might find more men on Twitter than women. That would be really interesting to know from a behaviour perspective and perhaps counter some colleagues of what they would expect. Now the next question is obviously a lot of people say well engaging identity play, matching names looking for signatures, using algorithms how do you know that you're picking up a real difference that you're categorising people correctly? Well we can only really test that by looking at some real world events which is manifested somehow on Twitter by splitting people by male and female and seeing if their behaviour differs and we tried this for the London Olympics so what you can see now is a graph now I apologise for the pink and blue but I present this so often that I need a heuristic to help people understand what's going on so the jagged proper colour lines at the top of the graph are sentiment scores and we've taken the average sentiment for every minute and then we split it between male and female Tweeters for those who could identify so those who couldn't identify your unisex are excluded from this graph now this is actually from Super Saturday when we won three gold medals and what you can see is three peaks and sentiments on the pink line at the top which actually corresponds to real world events so when you look at the timeline and this is another way of checking that there's a link between the real and the virtual world is the first peak is when Farah starts his race the second peak is when he moves to third place and the final peak is when Jessica Ennis didn't win her medal but she got to the point where she couldn't be beaten now I'm just going to take a step back and you'll notice that these three peaks are all pink now this suggests that during this event the users we can identify as female used more positive language than their male counterparts and pretty almost consistently throughout actually I'm very noticeably so and there's enough difference between those two groups to be confident that we are classifying a real difference for me if we were classifying people randomly and not categorizing them correctly there would be no discernible pattern if we were random so for me that's evidence towards the fact that there is something in there so it needs refinement there are issues around what sentiments analysis is really doing and I've had I know conversations with linguists who hate it and I kind of think they have a point but when you're dealing with hundreds of thousands of data points and trying to process them in real time this is a quick and dirty admittedly way of getting some data and again the fact that we can see real differences suggests that it is picking up on something particularly when it's tied to real world events yes okay so sentiment peaks at real world events that's an interesting finding sentiment differs between men and women the one thing I should really point out is that this is likely event specific you might not find the difference at another event at the Rio Olympics you might find that men peak over women or whatever the point of Twitter data is that women are very specific within a particular context within a particular happening we always have to bear that in mind when we're trying to generalize we can't say that women are always more positive in their language than men without doing something much more comprehensive in all encompassing okay now make a few notes on location there are at least three types of sources of location and Twitter data that we can use and I'm going to move through them in a sense of utility but with quantitative data we have interval, ordinal and nominal and intervalism has more utility and more information than ordinal and ordinal and more than nominal we have almost a hierarchy of data available to us so for example we have geotype tweets now these are tweets made by individuals and when they are made the exact location of the user is recorded this accounts for anywhere between one and two percent of the data it depends on the events if you're at again the Olympic stadium you might be more likely to have geotagging turned on and if you're just tweeting generally commuting to work the rates do differ depending on the context this data is incredibly useful because geography provides a key for us to locate tweets which are generally lacking in context with the context of the geographical area that they're in for example someone tweeting that they don't feel safe and they have geotagging switched on we can look at the area they're in when they tweeted that we can look at the crime statistics deprivation rates we can look at all manner of things population density we can look at whether people voted in or out of Brexit we can look at perhaps attitudes in the local authority towards migration and so on and so forth now because we have point geotagged data we can locate someone in the lowest possible geography so output area for those of you familiar with it now one of the scary things about this is that if someone has geotagging switched on and if they tweet 20 times a day you can pretty much follow their journey through a city or any geographical area and see where they are so there are issues around consent and informed consent here which we're dealing with an academic community and if anybody's interested I can direct them to lots of different resources on whether users actually understand the type of data they're producing or not FAB so here we have a map which is just some geotagged data to show you probably what you'd expect of what loosely corresponds with Twitter use so Europe, North America, South America it's hard to tell in this map but I will note that there is a lack of density in China and they have alternative platforms there like Waveo so Twitter use is not universal it's maybe particularly appropriate for studies using Western countries so certainly North America is fine, Europe is generally fine, UK certainly and so forth. The dots you can see elsewhere that aren't on the map I am reliably informed tend to refer to the shipping routes or people tweeting on flights who probably shouldn't be so I mean there's some interesting discussion there about when people tweet and whether they should but it's interesting to see before I go on to the slide I've just moved on in the other two types of geography so in the metadata you get a description for people sometimes have a location so there's a Manchester, London Cardiff or whatever that that data is problematic because you don't know if that's where someone's from if it's where they work, if it's the area they're identified with or where they actually are that's the first problem they might even be abroad on holiday at a conference tweeting something and their data says they actually should be in Manchester. The other problem is that Manchester, London, Cardiff are not really useful geographical units because they're massive urban areas. If someone says they're from Truro and Cornwall for instance maybe that's more useful smaller area but if someone says they're in London you can't locate them within a parliamentary constituency for instance if they're talking about boat intention and so forth so that's the next level the lowest level and it's more for people interested in linguistics and natural language processing is mundane references to geography so just up the road from me or close to where I live those things are only usually at the reference point i.e. where they are at the time of tweeting but there is potential there for you know just by the train station when trying to work out if that data can be meaningful or useful in any way so obviously the most useful data based on what I said is the geotag data the issue is that the people who switch geotagging on are not a random sample of the Twitter population the Twitter population not a random sample of the UK population either but there is a tendency in studies to use geotag data because it's the most useful but we need to be aware that male users disproportionately more likely to use geotagging geotaggers do tend to be older users we'll talk a bit about the age distribution of Twitter users on the next slide occupational group has an impact and also geotaggers have different user interface languages so one of the pieces of information you get is a language of interface something is like 20 or maybe even 40 different language interfaces and there's also of course the language of the tweet which is something different again and if you do that and you can look at the papers for how we've identified this you see differences in rates often the differences between one category and the next are quite small but because of a lot of data points we're pretty significant we're pretty confident that the differences are real even if they're small there's an interesting discussion to be had on the side about what statistical significance means at all when you've got so many data points it's more effect sizes actually which are useful but that's maybe for another time okay so that's location what I'm presenting to you here is the age distribution of Twitter users so those we can identify which is the bottom part of the graph and the top part of the graph is age distribution according to the census and what you can see very simply is a clear peak of youthfulness in Twitter users which is not surprising there are a few things to take from this though that if you look at the number of users over 30 it's a small proportion but there are still millions and millions of users so a small proportion of a very big number is still a big number so there are still tens hundreds of thousands of users over the age of 50 in the UK we think that use Twitter so that's still a big sample the other thing that I've heard people talk about and I need to replicate this study maybe now or in a few years time to see if it's happening is that perhaps younger people no longer are signing on to Twitter and what we'll actually see is the peak in age will shift to the right it'll just continue rather than Peter out so it's not that there is a young cohort of people who are constantly signing up there is actually a generational effect and that generation may stay on Twitter so that's a hypothesis I heard a few people say it'll be interesting to find out the other thing I should point out is that Twitter terms of service meaning it's hard to use the platform hence the cut-off at 13 it's entirely possible there are a lot of people who are younger than 13 but had to put their date of birth to show them as 13 to be on here so there's opportunity there for identity play as well and the way we identify age is that very simply people write their age on Twitter and they'll say X year is over and with some simple pattern matching you can pick that up there's a lot of data on about 0.35% of Twitter users we could identify as being in the UK very small proportion again of a large number there is some evidence to suggest perhaps that there is a preponderance of younger users reporting their age and that older users are more reticent to do so there's a little bit of evidence appearing around that so again this is about social desirability about behavioural line about creation of a virtual identity and that this may not actually be representing what is actually going on it's only those we can identify so it's work in progress in that sense the final one is class in occupation so what you can see in the graph the lighter grey bars are what we think the profile by class looks like excluding students and not classified users and the darker grey is the proportion of class based on census data this is difficult the way we do it is looking for occupational terms that match with the SOC codes provided by the ONS and then we map them on to simplified social class so there's a lot of assumptions and steps there without going into too much detail the technical details in the paper including an evaluation of where we think we got it right and where we think it wrong is that it's generally quite easy to identify Twitter users in professions teacher, nurse, doctor perhaps those are all journalists and so on perhaps those are also the people that are more likely to be using Twitter to present professional identity the issue we have and the big issue that's hardest to get round is when people talk about a hobby such as photographer or writer on their profile it's not their occupation but a lot of those people can be in group 2 therefore we have an overrepresentation of people as we're confusing hobbies with occupations and there's a quite bit of discussion in that paper about where we can be probably reasonably certain that our estimates are correct based on the types of jobs and occupations that exist within particular groups