 All right, we are going to go ahead and get started. We now have a talk by Rayseche, who will be telling us about constructing data sets to study the effects of spatial apartheid in South Africa. And I will hand it over to you to get started. Thank you. Well, thank you. Hey, everyone. My name is Rayseche, and I'm a research fellow at the Distributed Air Research Institute. We all just call it there for a shot. And in today's talk, I'll be talking about the data sets that we all wish we had that are so expensive to create, other ways to think about creating them other than the financial route. So many of the world's countries are no longer colonized. We come from countries that used to be colonized, and there used to be rules that actually marginalized the indigenous people. But then today, the colonization is no longer the same, and we have democracies, and we have governments governing the people. But we still see people who are previously marginalized slives not improving as expected. And it's very difficult to just get evidence of this happening, because it's so expensive, to gather that evidence. So how do we think about gathering these data sets together with our communities in order to hold a lot of these governments accountable? So just to give you some context, I come from South Africa. And in South Africa, there was a birth date, right? And during a birth date, the government passed specific laws that grouped people into spaces. So they removed all non-European groups, and then they moved them into places called townships. And in townships, they were segregated by race, and they were controlled. So they could give them specific budgets. Different townships got specific budgets. Irregardless of the number of people in the townships, they would get hospitals and schools that are overpopulated, and they're not improved and not developed as often as they should be given the number of people they're serving. So this was during a birth date. But a birth date has ended over 20 years ago in South Africa, and we still see townships consistently having the same characteristics as similar characteristics as they used to. So this is an example of a township. So this is 1984, from 1984, all the way to 2020. And we can see how this township's visual characteristics have not really been changing. So governments are still allocating small pieces of land as if we're still in a birth date. And if you zoom in hospitals and schools, they are just overpopulated and under capacity. So how do we think about creating these data sets to hold our democratic governments accountable? So officially, townships don't exist in South Africa. I mean, they were defined really well during a birth date. But today, it doesn't make sense, right? So if you're trying to find data sets of where townships used to be or where they currently are, you can't find them. So instead, you find data sets like this one here, where they group suburbs and townships together into this new label called formal residential neighborhoods. And given a satellite image like that, you can clearly see the demarcation, right? Like township is the one above, and then the one below is suburbs. But then the government labels them as one thing, a formal residential area. So now if you're trying to say, in the year 2000, there were four schools here serving this population. The population grew, but then there are still four schools here. You can't say that those schools serve townships or they're in suburbs. So making it very difficult to hold anyone accountable. So we created this data set firstly that demarcates the extent of all the townships in the country for us to be able to place other data sets, like your hospitals, schools, parks, whatever, and see how they've been growing or changing as the number of people have been increasing in these neighborhoods. So this is the difference, the main difference, in our first contribution. So now we know where townships end and begin. And then we can further analyze what's been happening there. So the nice thing about this data set is also that it doesn't just tell you what the land is supposed to be used for, which is what the government actually puts out with this data set. It shows you where people actually live. So it defines townships according to the presence of people or visual characteristics where people actually live. So to create that data set, we firstly had to see what we had to work with. It's really expensive to create data sets like that that cover each and every pixel of the country, the entire country. So what we had was the satellite images of South Africa. We had this building data set. So each point here just tells you that it's a building or not. And just note, this is a wealthy neighborhood, mostly non-white people of color neighborhood previously. And then this was a township next to it. And just look at the density of houses in between the two. So we also had this data set from the government, but we've also added our labels. So this is what the land is supposed to be used for. So in yellow, that's the township, light blue, those are the suburbs. And these are the small holdings. These are these neighborhoods, and so on and so on. So the first step was to convert the data set from the government and add townships to it. And afterwards we said, OK, given that you know where the houses are, we just inflated these points into polygons that cover a standard South African house. And then after that we said, OK, the government has told us what the land is supposed to be used for. Let's just find which houses belong to villages, which ones belong to farms, and so on and so on. And after doing that, because we were going to be using the data set not to find specific people's houses, we just dissolved them so that we can't identify a polygon by someone's house or something like that. And it's also easier to work with computationally. Instead of working with 12 million data points, now we have just 12 polygons that we can just move around and overlay and work with. So finally, the data set looks something like this. This is a piece of Johannesburg, and the labels would look something like that. So all in yellow, that's where the townships are. In light blue, that's where the suburbs are, so on and so on. In white is the background, and so on. So sharing a data set like this, different domain experts can be able to integrate their own data sets, overlay their own data sets, and be able to analyze what's happening, where what is what and who it's serving. So this is the first contribution. But given how intensive it is to create this first layer of data set, we wanted to explore finding ways in which we can reproduce automatically data sets like these, because they were made with labels from the census. So this was in 2011. So how do we make sure that maybe we can continue finding ways to explore how the growth happens within these neighborhoods? So to do that, we had those data sets, and we just created a machine learning model that took in as input the images, satellite images, that we can get for every year. But then the labels that we had created are only for 2011. So we trained these models, different models, and I'm not going to talk about it. But then we created different models, and then we explored creating something that will take in an image that the model had never seen before. So this is part of South Africa in 2011, and this is that same part in 2017. But there's been lots of growth. So given a satellite image and a model trained with these labels, this is what the results sort of looked like for an image the model had never seen before. So it can actually detect that there's new growth, there's new buildings in that neighborhood that didn't have any before. So if there is something like this, if there's a pixel movement, and there are no schools or hospitals to cover or any resources in this neighborhood where people live, then it's a problem, right? Then it's for further investigation. So instead of thinking about the entire country and looking at each and every part of it, now we have a way to, firstly, find a way to focus on kind of first. So some advantages of the data set that we have are that the government data looks something like this. So imagine you're trying to train models to be able to reproduce visualizations that look like you're training data. If you have a patch like that, it's really difficult for the model or anyone really to say that this is a township given what the government would have given you, right? But with the data set that we've defined, we are able to say to characterize townships or sub-ups or anything really by the presence of buildings. And then we can characterize those buildings and then build our models accordingly. So some issues with the way we do it, well, some issues generally with creating data sets about things like this is that a country is quite big and sub-ups in one region, they don't look the same as sub-ups in other regions. So these are just issues that are inevitable, unfortunately. And you can find different ways to deal with them. So everything enclosed in this blue boundary is actually a sub-up. But some sub-ups have big yards and others have smaller yards. But actually it's really difficult to confuse sub-ups with townships and it's also difficult to confuse sub-ups with informal settlements. But it's easy to not confuse townships with informal settlements. So what we also explored was not just having 12 classes but having four classes. So some wealthy neighborhoods, so those are your sub-ups, your small holdings and then some non-wealthy neighborhoods like your informal settlements, your townships and so on, then non-residential neighborhoods, commercial areas, industrial zones and places like that. And we just did this division by the cost of real estate and having models, having labels like this created models that generalize better. So, but then it's not just issues with the data set, it's also issues with the models that you're using. Some models like this one was really sensitive to noise. So seeing a food path like this, it automatically assumed that this is probably an un-wealthy neighborhood. It should be here somewhere. But some models are actually quite robust to those kinds of things. So there's a lot of things to consider when you're doing work like this. So other things to consider is that that evaluation is key. I come from a township, so when I was building the models, I could see where it was getting it wrong and when the labels were wrong. I started with the government data set and then I was like, okay, it's confusing this and that. So can we further label, relabel the data when we need to? It's a very expensive process to decide to relabel the data. But then I know that if we created that model, it wouldn't even be reflective of anything that we're trying to get to. So evaluating with someone with lived experience is always really helpful. Working with domain experts as well can help you mitigate or get over small technical mistakes that could be really expensive if you don't take those into account. The final point is that usually when you're working with data sets of people, they can be really emotionally taxing. So you're looking at images from 2011. There are houses by a river bank and informal settlement that formed and then maybe in the next year, those houses are gone. So you're asking yourself what could have happened there? You can actually sip into a very deep emotional trauma just looking at data sets like that because we're dealing with human data and this could be any one of us because we are working with our own communities. Thank you. Thank you for that great talk. I really enjoyed it. We do have time for questions. We have about five minutes. Who would like to get us started? Hi, awesome project. I want to understand which are the use cases for the data after the collection, which is very thoroughly covered. Yeah, I mean, depending on who is looking at it, there could be a lot. There are actually NGOs we are considering working with actually in this next leg of the project. So there are NGOs in Cape Town, for example, who find sale of land that is meant for private sales and opposite with the government so that it's converted to sale for middle to low income people. So it's really useful for them to find empty plots in the country so that they can be able to see who owns them and if they're being offered for sale or they can actually go to court and try to find, get them for sale somehow. So that's a use case, but then there are other people who are trying to find informal settlements that are popping up, so trying to mobilize clinics and see if the government would put schools and things like that if there are kids there, what's going on, especially natural disasters when they happen, whether people over there can remobilize. It's actually a lot of people, different domain experts who could be interested in a data set like this. Usually when they're created, they're not at a country scale, so which is like a nice advantage of this one. I have a question. So you mentioned that the model gets confused around some things like footpaths. Where else does that happen? What else does it miss? Yeah, actually at the beginning we had boundaries around also the land, like the whole real estate and there were a lot of issues there because of the empty spots. So in the next iteration of the project, we had to find buildings exactly because those are the characteristics that we're looking for. But we used to have problems with that. The land within the empty space within the real estate that used to be a very big problem. Mountains too, because some trees, they look like people are living there. Also over time, the satellite images also got better, right? So the resolution is no longer the same. So there were some issues with like using the same model on an image that is of higher resolution. So we had to actually find ways around that as well. So yeah, but you have to closely evaluate what's going on before anything, you give it to anyone really. So you use a pre-training model with your dataset and what pre-training model have you used? Have you tried different models to see the output? What models have you tried, deep learning models? We actually didn't use any pre-trained models. All the models are trained from scratch because we have the data, it's about 10 million images. So we can actually afford to not use a pre-trained model. So yeah, we can just go deeper and add more images. We have time for a quick one. Okay, thanks for that, super interesting. The results, okay? So I was able to interpret the maps and all of that. But have you thought about how can you disseminate this information to everyday citizens? Because if they look at those maps and satellite imagery, they won't be able to interpret it. And it's something super important that they need to know about. Yeah, actually we are in the process of creating like a visualization tool that people can just easily use and also give feedback. Also we, yeah, actually we're creating it with Dylan. So that's gonna be something that does not even require you to have a computer or know about GIS formats or anything. So it'll just be on the web and you can just play around with it. And then also we've been talking to the press and trying to get it through to the normal person listening to like radios and like reading newspapers and things like that. But then also in the next leg we are contacting NGOs and like other community sort of organizers to also let them know about this data set that we're creating for them, actually with them as well, some of them. And yeah, but then we're open to like learning about other people who would be interested in working with us on it because the data set is created for the people, right? I'm also from a township. So I was really interested in having a data set like this but also policy makers and other people like that. All right, let's get another round of applause.