 What I'm going to be talking about is how you can get insights by joining two maps But before we go there just some basic bookkeeping things in case you're tweeting these are the hashtags You probably want to be using the bike on India hashtag my hashtag my idea is a son and zero You don't need to worry about the slides. They are online. I've already posted on Twitter The link to the slide deck this slide deck that you're using But if you desperately do want to take notes, then one small suggestion Research has shown that taking notes on pen and paper is much better than taking notes on laptops if you want to remember stuff Or on mobile phones, so this was a discovery for me In fact, it was my discovery of the year and I'm following it diligently. Do give it a shot if you want to take notes Let's dive in the story begins at the Karnataka elections in 2018 see about an eighth of the voters are Muslim and both the Congress as well as The Janata the JDS were trying to get their support while on the other hand BJP was taking potshots saying both of them are just trying to Peace the community The Hindu newspaper wanted to write a piece about how large a factor this is and where all the Muslim vote Is strong? You see here. We have a problem the thing is that the proportion of population by religion is available only at the district level or the Village level depending on where you get the data from and this is from the census Unfortunately Elections are not conducted by district elections are conducted by constituency and these are two very different maps so I have data in one map which shows me how many Muslims exist in a particular region and I want to see how many Muslims live in a different region on another map and even though they overlap There really is no direct way of getting the data from one layer on to the other So we literally don't know how many Muslims live in a constituency. So how do we solve this problem? well, the logical way is you could take one district and a constituency or a set of constituencies and Let's say the district has a population of 100 out of which we know that 13 percent are Muslims And we want to split it evenly across a bunch of constituencies. We could just overlay them So one district could cover multiple constituencies one constituency could cover multiple districts And there is a many to many mapping between these there is sometimes full coverage sometimes partial coverage So this district for instance covers at least one constituency fully and maybe this takes up about one third of the total area So I can say Approximately one third of that district's population, which is that red area lives in this constituency Or let's take another constituency which overlaps with this district So now only a portion of this constituency overlaps with this district So in this area, which takes up maybe about one fifth or about 20 percent of that district population I can say that population lives here in other words We are simply making an assumption that within a district, which is the lowest level of data that you have or if you have village data That's far more granular the population is uniformly distributed. That's the basic assumption now What we can do is fragment each of these districts and constituencies by overlaying them and creating an intersection Out of those and reassembling those and this is a process that I call reshaping the map How much of this can we do in Python? Can you switch over to this side? Okay, there is a library called reshaper that we put together The reshaper library is something that's very work in progress by the way You can find it at github.com slash grammar slash reshaper. It does exactly what I'm about to show you right now So let's give it a shot. I'm going to open up the ipython notebook. The library that we are going to be using for this Primarily the core library is geopandas for those of you who have been working with data Pandas is pretty much the de facto library to use for any kind of data processing Geopandas is becoming that kind of a standard for any shape file So if you have any shape file and you want to do any kind of geospatial processing an easy way of doing it is Geopandas and an easy way of installing it is through conda using anaconda rather than trying to do a pip install by yourself It's a little more efficient on most machines. So let's import geopandas now I have a shape file which has the karnataka census data Which will eventually appear I'm gonna load it once it appears on the screen. You're just taking a long time Okay, there we are back again so GPD has which is abbreviation for geopandas has a from file function that lets you load any shape file Now the other question you'll have is where am I going to get these shape files from we'll come to that in a bit It's not as difficult as you might think. Let's say you've downloaded the shape files This particularly is the karnataka census shape file. And what does this look like? Geopandas has a plotting function which lets you see what the Map looks like. So if you look at these constituting at these districts, this is a pretty large district This is be there and so on Well, let's take the area for these geopandas offers an attribute called dot geometry Which has an attribute called dot area which gets you the overall area of each of these regions And if you want to look at what that data frame looks like each of these regions Corresponds to one row. So the bugle code district is one row Bangalore rural district is one row and so on all of the data In the shape file also comes in here. You have a column called geometry, which has the additional geometry details This is a pretty large column You probably won't be going into the details of it and we've just now added one column called area Which is out here and this has the area of each of these regions and at the very least you can figure out Which are the larger regions which are the smaller regions. Let's do the same for the constituencies data set So here we have the constituencies. There are more. These are parliamentary constituencies, by the way Not assembly constituencies the difference being if you're electing someone for the parliament It's or an MP then it's a parliamentary constituency if you're electing them for the assembly, which is an MLA Then it's the assembly constituency parliamentary constituencies are bigger. So you'll notice that out here There are multiple parliamentary Constituencies that sit in the same region that this district sits in but it's not a perfect match Again, let's take the area and see what this looks like We have a bunch of these parliamentary constituencies like Gulbarga, Bijapur, etc and their respective areas now Geopandas has a function called s join which lets you take two shape files and Create all of the intersections around those shape files the fragments that I just showed you Out here Yeah, creating all of these fragments is what the s join function does so if we do that Then what it's now done is created a new data frame called merged and that has all of these shapes Let's validate that So there are 30 districts and 28 constituencies, but when you overlay them it turns out that there are 147 fragments each of which represents an intersection of a district and a constituency Now given this it should be possible to just take any metric Like the percentage of Muslim voters or the number of Muslim The size of the Muslim population from the district data into the Data that you have on the constituencies, but it turns out that it's a little trickier than that so you have to do a little more calculation and That's what's available in the reshaper library. You can take a look at the code What it does is moves the metrics from one layer to another in a way that is seamless so ultimately all of the Somewhere a lot, okay No, so it was not the function the function is merged out by the file is merged out by and This moves all the metrics from one layer to another Once we have this the final result is an excel sheet that kind of looks like this It has all of the attributes from both the layers So it says for instance that this particular assembly Constituency is broken up into three regions each of which maps to different districts So some of it overlaps with more on some of it with Shiva saga some of it with Therap In fact, these are across different states And what is the area of each of these along with a variety of other? Metrics that you can calculate and the proportion of area that is overlapping Once you have this kind of a data set what can we do with it? So let's revert back to our story what actually happened in the Muslim vote Well, this is the constituency wise Muslim voter population In Karnataka. This was used by the Hindu to publish an article around where exactly the bulk of the voters are concentrated So there is a chunk here. There's a chunk here. There's a chunk here Now what was happening at this particular point was there was a fight for an alliance the AIMIM which is a Muslim party whose name is very long and I can't even say it fully but they had won a number of seats in Telangana and We're looking to also participate in the Karnataka elections They had planned to contest in 60 seats now to make sure that they get the Muslim vote both Janta Dal and the Congress were vying for an alliance with the party and in April 2018 AIMIM decided that they would not be directly contesting in the elections, but instead would be supporting Janta Dal now we have the results of the elections By constituency we know the voter population by constituency Let us see what happened to Janta Dal Turns out that where there were more Muslim population Janta Dal actually got lower votes So you can see the net result of this election and the alliance Congress on the other hand had a mildly higher voter share and Where there was a significantly larger voter population it turns out that BJP was the one that gained the most Now while I'm moderately okay at Python. I'm terrible at electoral analysis. So I have no idea what this means Okay, I'll let you figure it out the Elections in Maharashtra and Haryana are also coming up and it turns out that Congress is allying with AIMIM and Let's just leave it at that So what can we do with this what kinds of data sets exist and what is the potential of being able to join? Data sets across two spaces. That's something that I'm pretty keen on he turns out that in India There are broadly three kinds of geographic hierarchies. There is a political boundary hierarchy a postal boundary hierarchy and an administrative boundary hierarchy By political boundary, I mean the state parliamentary constituency assembly constituency going all the way down to polling booth this has all of the results of the elections and One of the important aspects of this is that policies get made to a good extent at this level because the MPs and the MLAs are focused on their respective constituencies The second is the postal code boundary. There is a zone within which there's a sorting district within which there is a post office And there is a pin code and there are about a hundred and ten thousand of these in total The third is the administrative boundary hierarchy. So there is a state. There is a district There's also something called a division, but we leave that aside Then they could be a sub district block or village if it's a rural area or it could be municipality zone and ward if it's Township Now this apart there is one other way we can create our own hierarchies By the way before that in case you're looking for shapefiles for many of these The easiest way to get the shapefiles for India is to search for data meet maps data meet is a group that It's it's discussion forum and there is a lot of active discussion on various kinds of maps pretty much any kind of map There's a decent chance that you'll find it on data meet And if it's not there on data data meet ask the people they might be able to post something and if not It probably just doesn't exist But you can also create your own boundaries if you have a single location You can look at the area that is closest to this particular location than any other location So for example, if this were a network of let's say schools Then what is that region that is closest to a particular school than any other school? So if I take this particular point as a school then this red region Represents all of those points which are closer to this school than any other school This particular process is called a boronoi desolation and is something that comes out of box with QGIS It's something that you can create with command the command line problem again using the reshaper library But what that means is that now you can take literally any point and convert that into a region and The potential for that is quite high So if I look at the kinds of data sets that you can create with location boundaries, right? So there's Take all the hospitals take all the schools take all the bank branches take all the petrol pumps take all the locations Where crimes have been reported take any address? Take all the telephone towers take all the locations where there are stores of a particular of a particular brand All of these are data sets for which you can get an address and an address can be geocoded into a point if it can be Geocoded into a point you can convert that into a region and for each of these you naturally have some kind of data For schools, you know how many teachers or how many students there are for telecom towers, you know, which is the Organization that runs that tower potentially the telecom organization will know how many calls are flowing through it If it's healthcare data, you know, how many facilities that hospital has how many patients how many doctors all of these are data sets That can be that can be added to that particular cell In in your respective region What this means therefore is that if we take any of these data sets that which you can create from location boundaries Or that often already exists by administrative boundaries And this is a pretty powerful set as well senses gives us demographic data Asset ownership who owns laptops internet connections TVs car fridges social and religious data economic indicators Well income household indicators is the house made of a mudroof brick roof Do you have a toilet in the house not have a toilet in the house practically every government scheme is tracked this way So how many? How many people have benefited from the National Rural Employment Guarantee Act Banking data is reported this way health data is reported this way So effectively anything that the government runs is reported by administrative brown boundaries anything that the corporate sector runs by and large is Reported by locations so between these two there is enormous potential But there's also the fact of how decision-making happens Ultimately political boundaries are owned in some sense by an MP or an MLA and of course there is also the associated How? The IIS equivalents who usually run it by administrative boundaries So if I wanted somebody on the political side to make decisions then I could take any of this data and Put it on to the constituency boundaries If I wanted an administrative official to make a decision then I could take any of this data and put it on to a district If I wanted a manager or a principal of a school or the CEO of a hospital to make a decision I could take all of that data and put it on to their geographic boundary For example, one of the things that the Hindu again did was found that the Congress is doing much better in the Agrarian areas and they did that by taking the census data which had the percentage of farmers and Mapping that on to the voter constituency regions If we took for example census demographic data and school data, we can answer a question Where should we open new schools so that students don't have to travel far or where there is a reasonably equal distribution of students across schools if we took economic Indicators how well the country is growing versus Bank branch data then we can answer questions like are the bank branches Distributed based on population that is does every person roughly have equal access to the bank or Based on wealth does every rupee have roughly equal access to the bank or if it's in between How close is it from one to the other? We could find out whether increasing a district's wealth Leads to more theft so that means people get richer So does that mean to that does that lead to increase in crime or does it lead to less theft? Because the people are richer and they don't need to steal therefore and these are data sets that are available and can be joined Similarly with health data does poor health lead to an increase in the number of pharmacies that are set up in that region because the pharmacies Can sell more vice versa if you actually set up more pharmacies Does that have a positive impact on the people's health in that particular region? Now the reason these questions are trivial to ask but nearly impossible to solve today is because Merging the data across different kinds of layers of maps is non-trivial But both conceptually and technologically is quite an easy exercise What can we do to solve problems like these? Well me personally? I'd love to see more of these hidden insights come out But there are a few things that you can do literally right now first If you have an idea Take a look at these data sets any of the data sets that you know raise an issue on this particular repository on PyCon And I'd invite all of you to share this with people. It'll be great to see what kinds of ideas can be Solved using these problems and I'd like to crowdsource this to a number of people on the administrative side on the NGO side And on the corporate side to create a repository that says here are things that we can do If you want to try solving One of these and discovering your own insight To build your own portfolio to share some useful knowledge then start by finding a map Like I said data meet is a good place where you can find a map You can find the reshaper library on github.com slash grammar slash reshaper the links are again on github.com slash grammar slash PyCon 2019 This is the one link that you need to remember and if you find something do share it on Twitter Please tag me my ID is sRN0 I'd love to share it at least with the media and Get some people to understand the power of geospatial joints if you want to contribute to the library right now It's it's in a terrible state or if you want to learn more I'm planning to organize a series of workshops on geospatial joints do drop me an email My email ID is s.anand at grammar.com and I'll mail you the workshops if nothing else if you just enjoyed the talk And you've learned something about it then tweet about it the tags are PyCon India 2019 my ID is sRN0 More than anything else. I'd love to see some insights come out by joining data. Happy mapping So basically So my question is Basically, I belong to North is part of India. That's why I'm from Assam so what happened In terms of the the documentation for this geographic data and so those are always kept in a sort of in a Register we call registers or something which are all in a So how we use that image processing and all like to enable the Those things into a more of a like a public space Okay There are broadly three ways in which you can get this kind of data out The first is big borrows team somebody in the government may have this data So for example, if you go to the survey of India They sell these shapefiles, of course I've been trying to buy one of these shapefiles for the last six years now and I've failed And I've tried it through the Prime Minister's office and I still failed But it's actually easier to just walk over to the survey of India office in Deradun and give them a USB stick So depending on how you approach it Prove relatively straightforward On the other hand sometimes the maps don't exist So for example, most interesting anecdote was the former Head of the postal college of Mysore was trying to create a postal map Region of all of the pin codes. It turns out that Nobody knows what the region is that a pin code covers So he created that he uploaded that into ISRO's booban and then after about a year and a half realize that People have permission to upload into ISRO but not download from Booban So after one and a half years of putting all the data, the data is locked. It's not even there So today what is the best source of getting pin code data? Turns out that what people did was took various locations Geocoded them. They said this locations at this particular pin code this location that this This particular pin code let's draw a region around it using the concept of overnight polygons and publish it So the second possibility is to create such maps The third possibility that you talked about which is can we use image processing to detect it? Some features can be detected that way So for example, if you want to detect urban regions or constructed regions That's possible using satellite photography if you want to locate water bodies and whether they are growing or shrinking So for example in in Chennai is the Chamberambakkam Lake actually drying up That's something that you can draw a boundary around using image processing and that's a straightforward method But the thing is I don't think a single method will work for the wide variety of datasets Which is why we have many of these but the biggest lesson that I've learned is that 90% of the things that we want Somebody else has usually wanted and has managed to get their inputs So I find that the most efficient ways to ask and data meet is a pretty good place to ask if somebody already has this data Thanks a lot Anand thank you very much for your talk I've got a question regarding shapefiles. I had the requirement of using the map of India a few times And I suddenly realized that our External boundaries in a lot of places are in dispute and the kind of shapefiles that we get are not matching with what politically we Want our boundaries to be so is there any official place from where we can get these shapefiles because the only shapefiles which are available are those distorted shapefiles and I finally had to Change the shapefiles myself to use it. I couldn't find any official place from where to get the shapefiles So the official place is the survey of India which claims to sell these maps Like I said for the last five six years now. I've been trying to buy these maps It's actually not possible, but there are people who have succeeded and Okay, this talk is being recorded, right? Okay Let's just say that if you go to data meet maps, you will get unofficial but correct maps That's what so since you are in this field shouldn't we have a system of getting Correct official maps. Isn't there a process being put in place or something? I tried talking to a couple of people at the Prime Minister's office and Suggested this they put me on the phone with the inspector general of surveys or some such high ranking official Who said yes, absolutely connected me to some person who connected me to some person who connected me to some person Who was exactly the same person I talked to in the first place So I don't know. I'm sure there is a process. I Don't know it well enough