 Hello everyone and welcome to this webinar Bringing New Forms of Data to the Study of Cities. I'm Margarita Ceraulo, an Outreachers Officer working for the UK Data Service and presenting today is Dani Arriba Spell, a lecturer in Geographic Data Science at the Consumer Data Research Centre and based at the University of Liverpool. Welcome everyone, thank you very, very much for tuning in today. My name is Dani Arriba Spell and I work at the University of Liverpool in the Department of Geography and Planning. I'm also a member of something called the Geographic Data Science Lab, which is a research unit here in geography. So today basically I want to give you an overview of some of the main changes that I think have happened in the last few years around the area of data in cities and the intersection really of those two. The talk is pretty much in two parts. The first half I'm going to talk very broadly about big picture changes that have swept this data landscape for cities and urban research. And then in the second one I'll switch gears and show you hopefully a couple of examples of my own research. This is not really because I consider them to be the best examples ever, but they are projects that I was involved in. So in case there were any questions afterwards, I feel much more comfortable speaking for my own projects than other people. So before that let me tell you a little bit about what I call the geodata revolution or the data revolution. And the best way I could find to put that succinctly and clearly was in a sentence that kind of builds in itself. And is this idea that over the last decade there's been an explosion of data available to researchers and that's probably hard to counter argue. A lot of this data comes in new forms which what I really mean by that is that it's not necessarily more of the same kinds of data that we used to have for studying cities. So we don't have a census every year instead of every 10 or we don't have more surveys. What we have is a lot of data that look and feel very, very different from what we're used to. A lot of those data sets either relate to cities or to activities that happen within cities. So they are of high interest for people interested in processes that explain how cities work and the mechanisms that underpin those processes. And another thing that's very important for people like me is that a lot of these data sets come to your reference which in other words essentially means that one way or another either through address or lat loan coordinates or the unique identifier for the polygon or the geographic unit or the data point refers to we can essentially put them on a map and that's very good news for people who are interested in spatial dimensions of cities. And then the final caveat to this statement is that a lot of these data sets have come in what I call in a paper about a couple of years ago in an accidental way. And with this what I mean is that these data sets were never really designed for research and they were never really thought to be either representative of the whole population or to capture exactly the kinds of things you think they capture and so on. And this last bit actually is related to a lot of the challenges that I'll get to in a minute. But before that let me give you a sense of which kinds of data I'm really talking about. So this comes at three levels, at the individual level, at the business level and then at the government level. Let me back up. At the individual level what we have is data collected from mobile sensors. You can think of smartphones. You can think of tablets, smartwatches, similar gadgets that have a GPS enabled that are connected to the internet and that can then essentially start pumping data as the user goes by even in a passive way. At the business level what I really mean is that a lot of the companies and firms that used to do activities that are completely offline that have no digital counterpart have either taken parts of those activities and made them digital. So for instance you can think how shops have gone from keeping their books on books to keeping their books on spreadsheets and that's creating essentially a data set and if the shop is large enough that's not a spreadsheet. It is a database that is sitting in some server and so on. Or some businesses have entirely built their business model and their way of making money essentially around this idea of generating data and monetizing them. And for that you can think from Google which will now do some other examples that are much closer to this field such as WalkScore which essentially compiles an index of accessibility and walkability or Strava which provides a service for people to record their rights or runs, commutes and so on. And then the final one is the government sector which in the last 15 to 20 years has been an increased effort and it's not clear how long that's going to continue but at least what's sure is that in the last few decades governments have been releasing a lot of the data that they were using for internal purposes such as processing taxes, budgeting, etc. They've been releasing it as open data and that's essentially opened the door for a lot of analysis in terms of transparency and in terms of accountability but also has created as a site product again accidentally in many ways has given a lot of researchers what interested in it is the opportunity to look into aspects that just weren't available. In the UK the curious example that I can think of is the land registry which used to obviously keep track of how much houses are sold for and that was for tax purposes. Now everyone can go to the land registry website and check every single transaction that has happened in England. So that's a good example. So what does this mean in terms of the opportunities and challenges that it creates for researchers interested in cities? Let's start with the bright side and the good news and this is that essentially if you're like me this is like walking into a candy store because a lot of these data sets are much, much more granular over both space and time so instead of having a census every year now you have almost real time data feeds that are giving you information about some activities that take place in cities. In some cases they provide actually a better measurement for certain phenomena so for some phenomena that were interested in cities we've been forced to rely on proxies and a lot of those proxies these days we don't need them really anymore because we can actually access the actual process and all of these high granularity and better measurement and essentially constant collection of data instead of in intervals has created what you can think of almost or I think of almost as an always on observatory that essentially is keeping an updated record of where the city is at and this is really interesting in general if you want to know more about cities but it's particularly interesting I think in the case of evaluation of intervention so if you think of what happens to a city when a new train station gets put or a new railway line gets added to the subway system with a lot of these always on collection, data collection exercises you essentially all you have to do is let the intervention happen and then look at what the city looked like before what it looks after the intervention and then it's much easier to create evaluations or assessments about what the real effects of certain policies were so that's the good news, the bad news is that because a lot of these data sets are coming new forms as I said before there's a set of challenges that we in most cases didn't really have to think about in the old days so to speak so there's a ton of them and some of them are really important but I'm not going to necessarily mention in this context and you can think of privacy, ethics and so on from the pure point of view of the researcher who's interested in cities and tapping into some of these new sources of data I see three main ones the first one is the quality of the data in terms of bias, coverage, etc by this what I mean is that essentially because these data sets were accidental in the sense that they were never really intended for research a lot of the quality check and gate keeping that statistical agencies have been doing for a long time so if you think of all the efforts these sensors puts into putting out a representative and accurate data set all of those things essentially kind of go out of the window because if you look at Twitter data for example there's nobody at Twitter making sure that the georeference tweets are representative of everyone who is in a given city for instance so a lot of that is just the fact that it's a feature of these data sets and in some contexts it's not very relevant and in some others it can be very, very relevant something that social scientists need to wake up to this realization that just because this data is not necessarily going to be good data which is what we were used to in the old days so that's one the quality concern the second challenge is technical in the sense that a lot of these data can deliver to informs that we're not used to in urban research so again to make the analogy with the sensors you can go to the website of the ONS or whichever provider each country has the Census Bureau in the US and so on and click your way through a more or less intuitive website that is designed for you to find data although sometimes it's hard to believe that and at the end of the day click and download a nicely crafted file that gives you the information you want and that you are looking for in the case of a lot of these new forms of data this is not the case again because it was never intended it was never designed for research so researchers have to kind of adapt themselves to whichever way this information is delivered and the way this is is technically very different so you can think of application programming interfaces, APIs as the way that many companies make some of the data available or much more archaic and old school ways for instance a lot of the data that governments put out in many cases comes in the form of tables in PDFs and that's just the way it comes and it's not ideal but in many cases that's the only way you have so this basically means that researchers interested in accessing this data have to upskill themselves in terms of computational literacy and programming skills essentially and the final challenge that I have is a more conceptual one is what I call methodological challenges and it's somehow related to the previous one although the previous one mostly focuses on tools and to put it in a simpler way in the old days you had to know how to use spreadsheets and maybe some statistical software in this kind of new world you basically need to know a little bit of programming to access the data methodologically speaking I think there's also a bit of a shift in this in that some of the methods that urban researchers have accessed or have used and relied on for decades are not necessarily the best fit for this kind of data and this comes because again they were never designed for this shape and structure of this data set so you can think how much research has gone on in the econometrics or statistical literature or methods that fit this idea of small data sets where you have to impose a lot of assumptions because otherwise it's very hard to get to any conclusions in this new world that's not necessarily the same case and the challenges are different you can think of for instance instead of how to deal with the lack of data, how do you deal with too much data that gives you significant results in any context what do you do to find out what effects you're really looking forward and so on and a lot of this I think what it's going to translate into in the next coming years and to some extent it's happening already in many contexts is that we're going to have to go out and borrow from other fields and maybe even develop new methods so I don't think in 5 to 10 years it will be very strange for an empirical researcher who wants to look at cities through the lens of data and have to take courses in machine learning or computer science departments and I think that's starting to happen so those are the opportunities and challenges this data are bringing with them now raw data per se just the data records are not necessarily useful and this is not a new fact this is something we've known for a long time in urban research what we're really interested in is the insights that those data can provide us and that translation comes through the layer of the analytics and the statistical analysis so a lot of these new data sets have in themselves created new spins of statistics and data analysis in what's been called this field of data science sort of the tools and methods for looking at modern data sets with modern infrastructure and so on now because a lot of these urban new forms of data are georeferenced what we are trying to push I'm not alone in this call is that this by itself is not enough essentially and if we have decades of research in GIS and geographic information systems we should not forget that because particularly if you look at the data science curriculum essentially completely ignore space and geography so there's no it's as if data were existing in some vacuum and not in geographical space so what we're proposing is combining these two what we call modern term geographic data science and by that what we mean is the combination of data science approaches and tools to deal with modern data sets and combining that with the expertise developed by the GIS science literature, the geographic information literature to bring in the power of location and geospatial data which in some context is particular and makes it different data so that was essentially my conceptual part of the talk now I'm going to switch gears and show you a couple of examples that I've worked on in the last few years that I think to some extent encapsulate most of the ideas I've just talked about reasonably well so the first one the first one is something that eventually made it into a feature graphic in regional science studies that I ended up calling the spoken post codes and it's about neighborhoods and it's about redefining neighborhood boundaries so neighborhoods are areas within cities that share the same character actually if you go back to the Oxford dictionary that's the key what the definition eludes to and it's this idea of character that essentially makes a part of the city a neighborhood in itself is sharing this same character and then at the same time the idea of neighborhoods or small areas within cities is a key concept in a lot of public funding and a lot of social science in the sense that many studies and many public policy decisions are made on this idea of or rely one way or another on the idea of the neighborhood as a meaningful of analysis and that's why some policies target certain neighborhoods and a lot of the urban literature and social science also uses neighborhoods for their analysis now however a lot of the available boundaries or available forms of neighborhood essentially rely on administrative boundaries and in some cases that's fine and there's no problem with that in some other cases the processes that the researcher is interested in are probably different enough so that administrative boundaries are not good proxies to measure what the neighborhood is so in this what I did in this piece of work was trying to withdraw neighborhood boundaries so they better represent this idea of character and this was just an example so I wanted to also use a new source of data to show how well to do something that essentially couldn't have done 10 years ago because we didn't have this data and again same as I said a couple of slides ago kind of put everything together through this idea of geographic data science methods so to do that what I did was I took a bunch of georeference tweets in the city of Amsterdam which is the map that you're currently seeing on the screen and each tweet had a message and then I extracted the language that that message was written in and I essentially classified every small area in Amsterdam into I aggregated the tweets by language by area essentially and it's based on the proportions of different languages that every small area has that I aggregate areas into more consistent neighborhoods so to give you a reference what you're seeing right now is the current postcode map of Amsterdam so what you see is a very neatly structured setup where every area every postcode is almost as large as every other one and that's by design because they need to have the same size the same or at least similar shape they're very equally structured around the city it's almost as much as the geography of Amsterdam allows which is actually not very much they're very orderly and very they're structured almost like a grid now if instead of these what you do is redraw this language information from tweets and then you aggregate that through a space of machine learning algorithm as it did in this what you end up is with a map that looks more like this one which it's a map that has the same number of neighborhoods as the previous one but these are drawn based on similarity of the mix up of languages so essentially if you're familiar with Amsterdam the city center comes in in the yellow and big the small yellow and big blue blurb in the center and what those areas are telling us from the analysis is that the mix of languages that you get is very similar within the area and it's different from the other areas to kind of put it in short and if you look at this map let me flip back again to the original one and then back again what you get is a very very different geography of neighborhoods in Amsterdam and the one that has come out of the language data is essentially much messier is much more irregular there are some polygons that are very very large like if you look at the bottom right all of that is one neighborhood and the post codes that was four or five and there's also very small pockets of outlier areas that are a neighborhood in themselves and which are much smaller than the post codes that we had in the previous map and in a way if you think about it that's essentially what cities are, they are messy mixes of people activities, organizations, firms and everyone goes about with their own interest and the result is equally messy and it's not surprising and actually if you know a little bit of Amsterdam, I'm not going to do a very scientific validation of these but a lot of what you think of the city when you look at a map it's captured pretty well in terms of the boundaries that this algorithm produce so that's one example, the second one is based on Lisa calendar and this is also in another paper that's still under revision the idea in this context was slightly different it was more of a methodological challenge to design new visualization approaches that allowed us to make sense and exploit as much as possible large granular data sets about urban activity like the ones I've been talking about in this session in particular what we wanted was to identify hot spots of activity in a data set worth of three years of mobile phone activity in Amsterdam so essentially what we had was aggregated usage for every antenna for the city of Amsterdam and we had hourly measurements for over three years although there's missing data so what that means is that we have about three hundred antennas which each of them has a catchment area so you can think of them as a polygon in which the whole city of Amsterdam is split up into exhaustive areas and for each of those two hundred something areas we had a measurement of activity for every hour for three years if you do the math that's essentially a lot of data points now if you completely disregard the time dimension of this data set and you only look at the spatial one which is not completely unlike you what you would have been able to do with say sensors data every ten years what you get is something like this this is the purely a temporal geography of Amsterdam this is a clear cluster in the city center that captures most of the most of the canals where a lot of the tourist activity and non-tourist also daily life activity goes on and that's represented in the map with red because it's a cluster of high activity and then the north of the city, north of the lake eye what you have is a cluster of low activity residential low density parts of the city and that's also picked up by the algorithm now the map you're seeing on the screen at this point it's essentially what you could have done with the sensors the sensors doesn't provide cell phone activity but if you look at population density or employment density is not very different now because we actually have very fine-grained temporal information we could decompress this purely spatial map and turn it into say one day of activity and that's what you see here so what we've done here is for every hour of the day we've reproduced the map that you see on the left-hand side and what you see is that soon enough during the night there's not a whole lot of activity everywhere in the city and as the day picks up around 9 or 10 a.m. activity starts picking up the cluster starts defining and then it grows and as the day points down it shrinks and splits into a couple of sub-centers and then it dies off around midnight now this is great and this is already something that you couldn't have done necessarily with sensors because the temporal resolution is very small but if you remember what I just said a slide ago I have three years worth of data like this essentially almost a thousand days like this 24 maps per day per three years of days is a lot of maps and it's very very difficult to make sense of what changes are what happens with over time are there any main patterns that are changing are there areas that are popping up as clusters areas that are winding down and it becomes very difficult to visually assess this and we really represented that as you're seeing now so what we did in this project was to come up with a different way of approaching the data and to visualizing it in what we call the calendar and that essentially looks like what you have on the screen at the minute and what this is essentially a color map that assigns red if the well it's one calendar for area so you have as many calendars as areas but for each area you can look at the results of the cluster analysis and you can see whether the area was considered a hotspot or it was a part of a cluster of high activity low activity and so on so the calendar structures information in the following way along the horizontal axis what you have is every single day in the data set and already you can see how this is a good tool also to spot missing data so essentially every white stripe indicates we don't have data for those days and that's the entire data set from December 2007 to November 2010 and what it does along the vertical axis is represent the hours of the day so starting at midnight with zero and all the way into 11 p.m. and then you can see it because they're very fine but essentially what you have is a grid there where the horizontal axis represents the day of that observation and the vertical the hour and then we color that depending on whether the area was part of a cluster of high activity in red cluster of low activity in blue or what we call space allowed layer so an area with low activity close to a cluster of a high activity and the other way around so that's the way you read this calendar this is the one for light supply which is the epicenter of tourism tourist activity essentially if you visit Amsterdam there's a good chance you've been there and what you see essentially is a profile of the city that what profile of the area that you would expect and also that remains constant so it's basically this graph is telling us that around 9 a.m. the area becomes a cluster of high activity and that stays constant pretty much all day long until around 10-11 p.m. which is consistent with the rhythm of the area and what's also important throughout the three years that we have that profile stays constant now another example you can see it on the screen now that's the area of Zauld where the World Trade Center in Amsterdam is located and what you see is a slightly different profile because it looks much more jagged and it's almost it's not constant it picks up as an activity center at 10 a.m. it stops at around 5-6 p.m. and then every so often there seems to be a line or a day where there is not a lot of activity this is because essentially there's no residential or amenities here all there is is banks where people go to work around 9-10 and go home around 5 p.m. and that's what it shows and also it's an area where in the weekend there's not a whole lot going on so that's why every so often you have a part that is not color red and that gives you that this kind of profile of the area now these two you could say well this is interesting, it's a neat way of looking at the data set but essentially I could have told you that already because I know that Lights of Plain is the place where all the activities had and I know that Zauld is a place where people go and need to work and then they go home now where the usefulness of the calendar becomes much more obvious is in the next couple of examples where it's hard to happen as the data are being collected so what you see in this case is an area in the west of the city not too far from the back of the fundamental part of the main green space in the city and what you see is again think of this idea that I mentioned before of the always on observatory always collecting data whatever happens and then you can go back and see what actually happens so what we see here is that in the course of two or three years these areas started as being part of no cluster there wasn't a lot of activity it wasn't also a cluster with low activity it was just an area where you could think of as an average area in terms of mobile phone usage now something happened around June 2009 that activity started picking up throughout the day when I was doing this paper I didn't know of this area I was just flicking through every area and then I was surprised because it's a very particular profile then soon after I went back and looked up the area and there was a renovation of the main square in the center of this polygon that essentially brought in a shopping mall and a lot of office space and so on around 2010 2009, 2010 so essentially what you can see is the change of this area and the morphing of the area from an average area into a cluster of activity and if you were an urban planner who had spent a lot of interest in this renovation you might be interested in seeing whether it actually has had any effect if it's attracted more people, if it hasn't and if you were to be looking at this calendar thought about collecting this data daily as you could because it's mobile phone data, if the company was happy with that you could imagine how you could build this calendar live as it happens every day and start seeing these changes in the area or start visualizing the city as it's being made essentially or almost in real time the final example I'm going to run through very quickly it's another spin on the preview stories between the west of the city and what you have here is not it wasn't actually an average area, it was an area with not a lot of activity but it was close to a cluster and that's exemplified by this light blue, I didn't explain it before because of time constraints but essentially light blue is a space a lot lighter that captures an area that is close that it doesn't have a lot of activity but it's close to a lot of activity and what you see throughout the course of the period of this data set is that somewhere towards 2009 it starts flicking from light blue to red and by the end of the data set it's pretty much always red instead of blue, what that means is that this area has gone through the transition of being close to a cluster into being part of the cluster and essentially that's what a space of spillover looks like if you put it on a calendar so that's pretty much what I had in mind to show you today just a couple of take away messages that I would like you to remember hopefully for this session then if there's any question I'm happy to elaborate on anything that I've talked about today but essentially I would be happy if you walked away from this webinar thinking or realizing that the world of data and the data landscape has tremendously expanded well and it's still expanding very, very rapidly and that a lot of these data that are being added are not necessarily the same animals that we're used to in urban research they are better in some respects and they're not that great in some others and it's a balancing act of being aware of the limitations and being able to exploit the benefits that it's at stake here and that's why essentially I think it's a huge opportunity for people like me or hopefully like you interested in understanding cities or things, activities that happen within cities and then the final one is the traditional methods we've been taught probably to look at data on cities are not necessarily in some cases they might but they're not always necessarily the best way to approach the new forms of data because again they're different animals and they have different characteristics so there might be better tools out there to go and learn or to develop so on that in case you want to download the PDF this talk is mostly based on the following references which is papers that I've written over the last five years some of them are published, some of them are in the process of trying to be published Thank you for presenting today Danny and thank you everyone for joining the webinar Bye