 So, here's an overview of what we're going to cover in this webinar. You're going to have a very brief introduction to the Big Data Network, Phase 2, and the Urban Big Data Centre, and then Mark and Shiva are going to talk about the integrated multimedia city data that they've been collecting. So this consists of a number of data strands, a survey, GPS and life-logging devices, image analysis, textual media data and multimedia data, and at the end there's going to be time for questions. So just to say that all attendees are muted so we can't hear you, if you do have a question then you can type it in the question box at any point in the webinar. If you can't see what box I'm talking about, if you look to the top right of your screen you should see an arrow, a red arrow, and if you click on that arrow you should be able to see where to type your questions in. So there's going to be no time to answer the questions actually during the webinar. The talk's going to be about 30 minutes long but then there'll be plenty of time for questions at the end. Okay, so this is the third of three research webinars that have been given by researchers from the Big Data Network, Phase 2. So what is this Big Data Network, Phase 2? The ESRC has invested in three business and local government data research centres. The BLG which is based at the University of Essex, the CDRC, Consumer Data Research Centre at Leeds and UCL, and the Urban Big Data Centre and our speakers today are from the UBDC. So the three research centres aim to make data that's routinely collected by business and local government organisations accessible for research purposes and this is to benefit both the data owners and society as well as researchers and in such a way that individuals identities are safeguarded. And Big Data Network supports the UK data services funded by the ESRC to support and coordinate activities between these three centres. So our aim is to unify data discovery across the three data collections to encourage sharing of information and expertise across the centres and to coordinate user training and capacity building for researchers using their data. Okay, so that's all from me. I'm now going to hand over to Mark who's going to start the main part of the presentation. Okay, thank you Sarah. So hopefully you can all see my screen now. Okay, just a start with, let me just introduce my co-presenter, Siwa Naviska, who's the research fellow here at Siwa. Hello. Okay, so I'm going to batter on because we've got quite a lot to cover. And just, what I want to do is just say a little bit, very briefly about the Urban Big Data Centre. Some extent some of that's already been said by Sarah, but essentially our centre which is funded for a five years period is, the aim of the centre is to bring together and link data from our wide variety of sources. Local government and business data particularly is the aim, although we've got wider sets of data than that. And the idea of bringing these data together and linking them together is that we provide new and innovative data sets for a wide range of people, not just academics but also policy makers, but as much as possible to the general public as well. But not only are we sort of bringing these data together and creating these data sets for other people to use, we do a number of other things like we do a lot of education work around big data, about working with big data, and we have a remit in that way. We also hope to facilitate people in terms of new methods, and we hope to be at the cutting edge of new methodologies as well, and we hope to facilitate others in those areas. Okay, so let me just talk a little bit about the integrated multimedia city data projects as it's called, or IMCD as I'll refer to it from now on, or as a colleague of ours calls it the IMACD project. This is a project that was funded by the ESRC as a way of kickstarting the data held by the Urban Big Data Centre. It's always been run slightly separate from the Urban Big Data Centre, and I'll come on and tell you a little bit about that in a second. It's a collaboration across the university, but not just our university, but other universities. There are colleagues in Newcastle and Sheffield who work with us, but within the university we've got people who are working in urban studies, in transport in particular. We have two teams in computer science who work with us, the information retrieval teams, and then we have a team from education who have a specific learning remix. So we've collaborated quite widely across the university. The Urban Big Data Centre has wider collaborations in that and includes other people as well. Okay, so let's get to the meat of the IMCD. The main thing I would say is there are five real main strands of the IMCD data, but they tend to hang around the household survey. The core part of this was a random representative survey of people in the wider Glasgow area. And by that I mean the urban metropolitan area of Glasgow. It's commonly known as the Glasgow Clyde Valley Planning Area. As I say, it's a random representative household survey. It includes all the adults in the household, and it's unique in that it's got a unique combination of modules, but really the uniqueness of the project is more in the combination of these five data strands. So as part of this project, we've collected data from a sample of the household survey which included carrying GPS sensors and life-logging sensors. I won't go into that too much today because she was going to talk about that in more detail. Just to say life-loggers are wearable cameras, really, although they collect wider information than that. The other main parts of the survey are we have a textual media data retrieval project, and essentially they crawl the website, downloading data about Glasgow from a number of websites. We're talking over a hundred websites, websites like social media websites like Twitter, Foursquare, and other social media, but also news websites, BBC, but also papers, what we collect data on weather, and a number of other things. And then we also collect from similar websites, we collect multimedia data, mostly images, again, downloading and connecting information from Glasgow. And finally, and a little bit tangentially, we have also essentially bought a mixture of satellite data and LiDAR data. Satellite data, we bought stereo pairs to allow us to build 3D models that can be used for other research purposes. For those who don't know, LiDAR data is essentially very similar to 3D data. It's created by airplanes, essentially passing backwards and forth across an area, bouncing lasers off the ground and building up a 3D pattern, a very, very accurate 3D pattern of the city. Okay, so I'm going to talk a little bit, move on, I'm going to talk a little bit about the survey and then I'll hand over to Shiba to talk about the other parts of the project. As I said, this is a household survey. We targeted a, or wanted 1,500 households, and what we expected to get somewhere between a, from those 1,500 households, we expected to get around about 1,700 to 2,000, 3,000 people. So, that's a range of 60 to 100 percent, so we weren't expecting to get 100 percent, probably closer to the 60 percent, if we're honest. The coverage, as I have said before, is the Glasgow Clyde Valley planning area. So the survey has main modules on travel and transport. It also, we also collect data on sustainability, so we collect data on green behavior and green attitudes. We also collect information about ICT use, and we also have a large module on education and learning. And one of the unusual things about the educational and learning part is that we collect information about knowledge. So with every single part of the survey, we tried to collect data, that was data on attitudes to that area. We tried to look at behaviors in the area and also look at knowledge. So we could look at people, what people reported attitudes were, what were their behaviors and what do they look like and compare to their attitudes, but also knowledge. So we asked a number of very innovative questions that look at people's understanding of the area. Okay. Also in the survey, so that was the core of the survey, but also quite a large number of the questions are more typical survey questions that you get in surveys like the Scottish Household Survey and a number of national surveys. So we had large demographic sections. We've got asked questions about people's neighborhood and their communities. We asked some questions about where they lived before. They moved to the area they're in now. We had a small module on health questions. We asked information about people's employment and previous employment. We have quite a significant amount on income. And we have a small section really about people's cultural and civil engagement. Just to give you, these are two of the more unusual questions that we have. So this first one is really an example of one of the knowledge questions, one of the questions designed to try and get to understanding. So essentially it's a calculation. So people are asked a question and then asked to give an answer, but it requires them actually to work out the answer. And another example of a rather unusual question was does your household have any pet dogs? And the reason for asking this, we were requested, could be put this in the survey, is because there are a number of researchers in public health think that the ownership of a pet dog might be a contributor to a better health. So we included this in our questions. Just very briefly to finish, we were in the field, the survey was in the field from the 15th of April to the 21st of November in 2015. We actually achieved 1,508 households. I asked 51% of the people who were approached accepted, but a 30% of households, of those samples refused to take part. But of those that did agree, of the households that agreed to take part, we got 2,095 people who took part, excuse the spelling mistake. So 74% of eligible adults. So that's about 1.4 adults per household compared to the 1.9 adults per household in the census. And the data is at local authority level and that should be available quite widely. So should be able to get that data fairly soon. Okay, and now I'm going to hand over to Shira. So hello. So one of the strands from IMCD project was the central survey. And in the central survey as Mark said, we asked people to carry tiny GPS devices and wearable cameras. GPS device collects information about your location every five seconds, and the wearable camera, it takes photos every five seconds, but not only photos. It can also take information about how fast you accelerate, about it can detect light, it can detect brightness, and a few other things. So before we started the sensor survey, we had to first choose a proper device. And as we wanted to make it really easy for people, we didn't want them to spend much time on charging the device and learning how to deal with it. Therefore, we chose the device that was not only cheap for us, but also the best for the potential users to use it. Also with the sensor project, apart from the GPS device and the sensor the GPS device and the life logger, we asked people for some ground truth data. So we needed something to verify what sort of data was collected via GPS device and life progress. We asked people to fill the activity diary where they told us what they did on the first day of their survey. So to be able to proceed with the data collection, we needed first to design systems and hire researchers who would deliver trackers to all the people who agreed to take part in it. So we had a system in which we were able to manage when and where people are getting the devices. And once we collected the devices from the people, we could put all the data into our life log memory server where we could visualize the data and see what we have there. So some statistics. So from this 295 people who took part in the survey, we got 90%, so almost 20% of people who decided to carry either one or two devices for us. And we thought it was a fantastic achievement according to previous studies. Here you can see how the differences between male and female and people who were carrying both devices and one. And the very important thing is also that the average age of respondents for the sensor survey was a little bit lower than the average age for the household survey. So as we may think younger people are more into technology and were more willing to help us with it. So having the data from GPS trajectories and life loggers, we were able to combine it. So not only to see where the individuals were going, but also what these people saw. So to be able to enrich this information in some more, this data in some more information, we had to first process GPS data. So we had to clean it because there are some crazy outliers. At one point it's in Glasgow, the next one will be in Iceland if you lose the GPS signal. Then we had to segment trajectories. So we had to divide them into a homogeneous segment, so movement and non-movement. From this point, we identified stops. So we knew where the person stopped. Knowing where the person stopped, we could add additional information to this. So we could try to identify where is the home of a pool, the work, dentists, etc. So having all this information and classified data, as you can see on the map here, we covered quite a lot of Glasgow with the detailed movement patterns of all our participants. And then imagine that from all these people, so it was 402 people who carried GPS device, 265 of them carried also life logger, that what gave us an opportunity to create not only something like Google it with Street View, that we can see what's going on on the streets, but also like a street view inside of your house, inside of your office. So we can analyze what was going on constantly and like everywhere, not only on the street. So because as you could see on this data, we could see on the data that GPS is quite, it could be quite intrusive, the data that you get from it could be quite intrusive, but it could easily give us information about all your locations. I was wondering whether, I was wondering whether you know what techniques could be used to protect privacy with GPS. And I would like to show you now a movie that we have animation that Glasgow's Kulowart prepared for us showing movement of men and women in Glasgow within one week, sorry for the colors, it's blue for boys or for men and pink for women. And to answer what you said, that you can blur the data, you can obstacles, star and end point, restrict access to the data or leave out of the street names on the map, we can do all of it. So in our data set, we first, what we did, we obscure, start and end point on the map, but we can blur the data. There are various methods to analyze trajectories to protect individuals, individuals privacy. And the data sets you can see here, we can distinguish different patterns of women and men during the week, and you can see it on Sunday women spend quite a lot of time on shopping in the city center. So the next part, the next strand of the data, the next strand of the IMCD is the textual media data and multimedia data retrieval. These slides, we got from our colleagues from Terrier Dream, so thank you to them for these, and I will try to tell you as much as possible on it, even though I haven't done it. So I won't be talking about all the data they collected, but I will focus on three main types, and we'll show you later the analysis that are possible with this data. So we collected Twitter data, and we've collected more than 65 million tweets in between July 2014 and November 2015. Within this number, we also collected some geo-located tweets, and tweets that were created by certain users like BBC, West Scotland, Police Scotland, Glasgow City Council, etc., and also tweets from certain, for certain topics, like certain hashtags, so like Glasgow, Glasgow 2014, Glasgow 2015, Glasgow, Buchanan Street, etc., and I, and here you can see just the tweets marked in Glasgow taken from Eric Fisher map, so these are all geotag tweets in Glasgow, and I was wondering whether you know what is the percentage of tweets in our sample that are geotagged? There are a few people who had it right, quite many of you said that it's 10%, so we had less than 1%. Less than 1% of all the tweets were geotagged, so still from 665 million, it's still quite a lot to analyze. Another tiny strand of data within the Terrier Team project was the weather data, so we are getting information on Glasgow weather, so there's hourly information about temperature, humidity, precipitation, wind direction, etc., and this data is collected at the same, it covers the similar time as Twitter and the survey. Next strand was connected with transportation data, so we got some static data about cycle routes from Glasgow City Council, we have also railway stops and railway stop names, locations, and identifiers from NOCTAN database, and also to make it more real-time, we have a daily data fit on timings of passenger trains, so it's a real-time feed about the train delays. So the Terrier Team, to allow people to analyze these different difficult sometimes to get an individual's data set, they created something called Euroban Data Dashboard, and this data dashboard is restricted by IP address, so you have to apply to UBDC to be able to get access, and then you have to do it for a particular research project, so then it can be validated whether you can get the access or not. But what they did, they created a few data services inside, so social media, where you can analyze different terms and different tweets and try to detect different events on Twitter. Also, there is part that allows you on analyzing transportation and do-ever. So I'll focus quickly on the transportation data, so we can get the information about the different times, different train delays according to different days. You can also visualize your own GPS trace, it can be also classified, so you know, excuse me, where the person was walking, driving, taking the bus, etc., so you can analyze all this data, then we can try to combine the information from train delays and all the transportation analysis with Twitter data to find out what people thought about all these things. So from the data we've collected, we know that there was a huge peak in delays on the 1st of July, and then we analyzed the Twitter for the unhappiness about the Twitter, about public transport and commuting, and we still got the same peak on the 1st. So we thought, hmm, there must be some like weird correlation in it. Do I have the weather? Sorry. Oh, I think I lost one slide, but there was a third one saying, showing the weather in, weather in Glasgow, and on the same day, on the 1st of July, it was the only warm day in Glasgow, it was 26 degrees, therefore people were showing their comments on that it's hot, people can't stand standing in the trains, that the Scotswick can't cope with the weather, etc., etc. So what I wanted to show via this slide is that you can do proper analysis and proper investigation by combining these different data strands. And I think that's it. I hope you enjoyed it. Thank you very much for your attention.