 We're going to be talking about predicting urban heat islands at Calgary in the city of Canada, the city in Canada. Let me first do a bit of introduction. There are two of us who are going to be giving this talk. The first is myself. I'm Anand. I'm CEO and co-founder at Gamberu Data Science Company. And along with me is Sumed, who's a senior data scientist. And he'll be taking over the second half of the talk where we talk about what it is that we exactly did. But let me first tell you a bit about the context. What exactly is the need to predict heat islands and how Python can help with that and what we did in that space. The heat island is when a portion of the city is much warmer than the nearby areas. Typically in urban areas, we have higher heat absorption and higher heat retention. That's one of the reasons it gets hotter. There's also a transpiration from plants which can cool the environment and water evaporation from the soil which can cool the environment, both of which are absent when there's more concrete on the ground. Water penetration is also impacted because of that. There's a whole lot less water in the surface soil than in rural areas. So the net impact is that an urban area is a whole lot hotter than a rural area. This is something that can cause deaths, literally. So in 2021, Calgary estimated that there were as many 66 deaths that were directly traceable to heat waves and urban heat. And this is not uncommon. This is something that has been increasing over the years in the city of Calgary. Heat waves are one part of the problem. Human health and safety is another part of the problem, but services, public services in general, get disrupted. I must say what we want to do is understand how we are contributing to this and what interventions can be made to reduce this problem. Now, what is it that they can do? Well, one thing that they can do is grow more trees and vegetation that lowers the air temperature because of the shade and the evaporation. It also reduces the amount of storm water runoff. So there's going to be more water underground. Green roofs help that reduces the roof and air temperature. So the surface is cool as well as the air is cool. Cool roofs as well. Cool roofs are different from green roofs in the sense that they're more reflective. So they take temperature away from the building and that reduces the temperature below the roof. Pavements can also be designed to be cool pavements that reflect more, that allow more evaporation. And all of these also improve the water that's below the surface. And that apart, the government has a series of options to spend the budget in smarter ways and improve the overall strategy. But the thing is every single one of these, it requires a budget. It requires a certain amount of effort. And doing this for the entire city is actually a bit hard. So their question was really this. I want to move to 100 meter by 100 meter granularity. If you could, where exactly is the highest risk so that we can take more focused action? And this means that the prediction needs to be at a whole lot more granular level than just taking the overall satellite data. Now the good part is the data is up there. Landsat 8, and we'll talk a little more about it, provides land surface temperature across the globe every 15 days. Landsat 8 is a satellite that has a series of sensors and one of the sensors is the thermal sensor. Now what we can do is look not just at the temperature data, but data about the buildings, about the vegetation, about a whole series of other things at that granularity and use that to predict whether in a particular region, the temperature in the next few years is gonna rise, fall, effectively doing the equivalent of the regression except doing it spatial. And this is in fact what we did and we're able to explain as much as 70% of the variability in the landsat's temperature. This talk is really about how we went about doing that. First, I'm gonna explain the kind of data that we got. And secondly, I'm gonna talk about, and I'll be passing over to Suneil for the second half of the presentation to talk about how exactly we modeled this. So the first part is understanding where exactly the heat waves are for which we need this temperature data and what exactly is causing it. Landsat 8 provides temperature data and this is through a thermal infrared sensor. This is something that can be used to map at a pretty fine level what the temperature for any particular region is at 100 meters by 100 meters. So for example, this is what it looks like in August, 2014. The land surface temperature is not too bad. You can see that it's relatively cool at this point. However, in 2019, you can see that it's a whole lot warmer. In just five years, the temperature has increased by as much as 67 degrees, that's quite a bit. You look at the temperature changes that have happened year on year. From 2014 to 2015 was a reasonably big shift and from 2015 to 2016 was just as big a shift. From there on, it's been fairly stable. However, it's not like all of the regions have been heating up uniformly. In fact, from 2018 to 2019, you can see that some regions near the river have actually cooled down a fair bit. Some regions cooled down in 2018. Now, it's important to understand what exactly happened as part of these local changes that the government can make use of and also see which of the variables have an impact. Now, here are some of the variables that we do have in the environment. So specifically, we can use the normal difference vegetation index that effectively talks about how much coverage there is in any particular region. The second is the surface water body. How much water is there in a particular region? That's a factor that can affect the temperature. The third is albedo. Does the surface reflect solar energy or does it not? Or how much does it reflect? And these are things that we can extract from the satellite energy. But there are some pieces of information we can't quite extract from here, like the population, which is something that we need to provide external information for. Then there's also information such as the normal difference built-up index. What I mean by the normal difference built-up index is the areas where there is more construction versus less construction. Effectively, the urban infrastructure. Also, the urban sprawl. That is how well dispersed are they? What I mean by dispersed is all the buildings clustered together, which has different temperature characteristics versus are they spread out more? The building count, which is what the total number of buildings of NERI are, the site coverage, and a whole series of these parameters. Now, these are the parameters that we use to build the model. And what we're going to do now, and specifically I'll hand over to Suneeth for this, is to show how exactly we modeled this particular scenario. Let me stop sharing my screen. Let me just share my screen, but the next part, okay, we do have Suneeth on. Suneeth, could you share your screen? Yes. Can I see the screen? Yes, it's coming up. So, the next part, the major part, sorry, let me go talk a little bit about the summation. So, the main part is how did we modeled it? The entire model was built on Python library. And those are specific geospatial related libraries. So, this was the in general flow. I will just scheme you through different sections that quote takes us through. One is, what are the different types of data sets that we are taking? One, as Anand mentioned, we are taking Landsat 30-meter data set. Landsat data set provides us with thermal bands, which are very good in terms of identifying the thermal temperatures of the particular object. The other data set that we are looking at is infrastructure data set. These data sets are basically with respect to building infrastructure. How many number of buildings are there? What is the height of the building? What is the parcel area of the building? The third is more of a demographic data, like population in a particular community for that Calgary area. So, we did some three major steps. One is data processing, data aggregation and feature engineering. These are the most common steps to be done while going for any particular machine learning or special machine learning algorithms. What we did in data processing is the raw data set was not enough. We calculated different indices from the satellite imagery data. We also calculated urban morphological variables, which includes, say, floor area ratio, building block coverage. All of these particular matrices will come later while I scheme you through the code. The data was aggregated at a grid level. So, we wanted to have information at, say, 100 meter grid. So, we wanted to zonal stack, like do aggregation of all of these particular data sets into one grid. So, one grid will have the temperature value, individual value, building block coverage, building relative data value. And some feature engineering. So, after doing all of this, we found out which are the features that are affecting the land surface temperature more and we kept only those. Now, one of the major method that is used for identification or prediction of land surface temperature was spatial regression. This we have, again, profound on 100 meter grid level. The type of spatial regression that we had used is spatial regression with fixed difference. In the bottom, you will find, like, these, what are the particular data processing that we have done. In terms of, say, Landsat, the data was available at a particular timestamp and we resampled it to 100 meter grid while doing zonal statistics and aggregation. Zonal statistics is a method where in which you give a zone to a particular entity and it aggregates information about all the entities for that particular zone. And third is spatial modeling. As I told, like, the year from which Landsat data set is available is 30, but unfortunately, we didn't got good coverage in terms of cloud. So, we took data from 2014 to 2020. To the point, Landsat data set is available at every 15 days, double. Moving on to the next slide. These are, I will not go into detail about these particular processes about how do we calculate the Landsat temperature, but these are derived from Landsat Handbook, which is also present on USGS website. We have also mentioned these particular sources into the Git repository, which is into chat box. With respect to this, there are some pre-processing, like atmospheric, spherical radiance, brightness, temperature, calculation of proportion of vegetation and emissivity, the product of which gives us the overall land surface temperature of that particular area. It also gives us what is the vegetation in that particular area and also gives us parameter like, is that area impervious or pervious in surface nature? Because Landsat provides different, different bands which are sensitive to specific particular feature on the earth surface. This I have already mentioned, but two details for this, like, I will just read the main major one. Zonal statistics is one of the major method and it's a time-taking method because say for cities like Calgary or any bigger cities will have a large number of grids. So since these grid are 100 by 100 meter grids, or if we span those grids across the city, those are kind of more than 900,000 rows. And for all of these roles, we need to get the statistics of each and every parameter which is present on the earth. So Zonal statistics is one of the main method which is used. And for special merging of the data sets, we are taking into consideration the geospatial Python libraries which can join the features on the basis of geometry. So each bounding box that we take will have one geometry. Whatever that falls into that geometry or intersect with the geometry will be taken into that geometries column. So in that sense, we created an entire table which gives us rows as well as columns which are say dependent and independent variables. So these were the basic results. These results will come while we are skimming through the code as well. But special regression and regression with fixed effects are two types. So normal ordinary least square regression if we do, we get to see that there are not much good results but we add the parameters of special effects like communities which are present over their community boundaries into the data set. We get to see that the explainability of the model or R square value of the model has been good as compared to above two methods. And the in general R square value that we got was around say 75 to like 71% that is 0.71. This is the normal special regression formula. It's like saying like any regression formulas but we are taking an extra parameter as constant of the community. So this is regarding the main stuff like how we did it but let me skim you through the code factors of this particular model. Anand, can you check if you can see the coding sheet please? Yes, it is. So I have given this particular link in the chat box which consists of where we have kept the data when you go into that particular folder. You will find that there are number of files that are being used to process this data set. You can download this and try it on your own. For the satellite data purpose we have added the links from where we have taken the satellite data. It's a manual download process that we have to follow and since satellite data sets are quite huge, we have not kept it into the folder, the respective folder but irrespective of that like irrespective of one notebook, you can run the other parts of the notebook including those the data which is mentioned. So the step one of these particular modeling process was to understand how do we calculate land surface temperature satellite related indices for which we are using different types of geospatial libraries as well as normal Python libraries such as pandas, math library, JSON libraries. What we generally do is we need to have data at certain resolution. In special sense it is at zoom level so we kept a zoom particular level. Now we get to know what is the particular boundary as Anand showed in the previous slide what was the exact boundary for which we want to calculate and clip the satellite data too. So we just want land surface temperature and other parameters for this particular boundary. We try to see how many TIF files we have downloaded that you can download through the USGS given site. So once you download the TIF you will find it in this manner. So one particular TIF file will have these number of bands. Now the main process is I have to clip all of those files with respect to the area of interest that is Calgary. This process helps to go pathwise, clip that data and save that data with a particular band name and selected nomenclature like anyone can decide their own nomenclature. We repeat the same process and we select that particular data set we clip that data set and we give the nomenclature wise name. This is how the processing looks like it goes to each particular TIF file and it tries to clip it. Now once we have clipped you can just check for which particular months it has run. So it has run for 2013, third month and 20 second date and so on. One of the major parameter while talking through the satellite related indices we need to have the metadata of the file which includes factors like radians, reflectance as well as sun angles, et cetera. That we get from the metadata of the file. We are just iterating over that file and grabbing the information which is necessary for us. We are also attaching a config file which tells us which particular bands which are being used for this particular processing. You will find the config file in this, in the data set part. The third thing is any satellite data has coordinate reference system. We are setting the coordinate reference system to the metric system, the local system. Now that comes the process very much we are extracting the parameters from the metadata and we are calculating the land surface temperature. So all of these calculations are mentioned in the PPT which is given. These, this is the step by step functioning of all the mathematical calculations which are written into PPT. I will not go detailed into this. So with this we are calculating NDVI, NDVI, whatever the vegetation fraction of water differences as well as albedo emissivity and finally land surface temperature. This particular file will individually get saved by their individual names and the data format which we have extracted. So this file will have, say for example, 2013, third month, 22, Calgary, NDVI. So once we have all of this step files saved we want to convert it into the grid manner as we have spoken together. So this is one of the function which takes individual file and convert it into a grid format and saves it with respect to the zoom level that we have selected. Now once we have created those particular grids I will show you how those grids will look like. Once we have the grid information we need to attach the demographic data, the waterways population data, community-wise data sets, community-wise boundaries. So we are again doing the special join method, the method which we spoken into the PPT and applying it on the VOD data set as well as community data sets. So for each particular date we will have all of this LST, NDVI, MNDW, all the parameters with VODs and community adjusted. We save these particular files to GeoJSON because those are geospatially unable files. Now the steps second and third are more of arranging those particular parameters into sequential manner. These particular files doesn't require, like I will spend time on this if, like if there is any particular specific time required. But just to scheme through, since we have data set in the horizontal manner, just like this we want to make it in a systematic manner. And these are, and add the time variable into that. So step number one, add time to grids, add attributes to grids and merge attribute to grids. Only talk about how we can get this data into systematic manner. These are like sequential operations one after the other, but not any major particular operation is done on this, like any special regression or application at such. These particular two are three notebooks help to get this data into, say, this manner. So we have geometry, we have the structure ID, we have the building height, and other parameters included into that. So what was the date when that particular, that particular building was set up, what is the area, what is the floor number and all that. Now, after having it into this manner, we want to add additional parameters which are processed afterwards. So after we have graded information, there are still two or three features which are, which needs to be calculated. One, we wanted to see in terms of temperature, if high agglomerated building areas have high temperature, for which we need to calculate one indices that is known as Shannon's entropy indices. How do we calculate it is with respect to each and every grid. So from each and every grid, we see if there are any number of buildings that can be grouped together and give one entropy value for each. And what we get to see is, it's a unit less quantity, Shannon's entropy is a unit less quantity, which generally varies from zero and towards one. So when we say towards one, towards one is very highly compacted areas and if it is near to zero, those are sparsely compacted areas. So in Calgary, we saw that like only for, only for downtown area, there is high agglomeration, but most of the regions in Calgary are sparsely agglomerated. So they don't have high number of houses in the small area. We also calculated building block coverage. So how much area of particular building is included into each and every grid? So if there are two to three buildings falling into one grid, what is the total building block coverage that it has so that we get to know how much area is covered in terms of land pass. So these are the two indices we can only calculate once we have grid in place because we want to calculate all the grid places. Now comes the last, now comes the last step where in which we are consolidating all of these datasets together and calculating a special regression on that. So whatever the steps that we have saved earlier, we are reading the same files again. I have added these files directly also into the data archival that we have. So if anyone wants to run only special regression on their datasets, they can just follow this particular specific notebook. They don't have to calculate the previous steps as well. So now once we load the community data, in first step, we just had included the boundary of the dataset. Now, just like this. The first step, we just had boundary because we wanted to clip the lines that you mentioned. Now we want to include what is happening inside different communities in order to understand the special autocorrelation between them. So this is just reading the heads, checking how the column looks like for each and every time. Now we select particular column names which are kind of variables that will be used into a regression formula. So unlike year or unlike geometry parameters, we need to use only those parameters which makes sense to go into a regression. So just like this, we select those particular parameters and we feed model into first is nearest neighbor. Second is ordinary list square regression. Third is special regression and fourth is special regression with fixed effects. Now special regression with fixed effects uses k nearest neighbors because it tends to check how many nearby grids that it needs to be correlated with. So if you take a three by three grid, if you leave the central grid, you will find like rest eight grids having impact on the middle grid generally. So we have taken k as a number of nearest neighbors that we get to say. Now, once we apply this, we tried to apply this through overall years like from 2013 to 2021. Now, we wanted to visualize how does it look like? So once you see this, so since satellite imagery was not available in 13, we don't have much good data. But if you see the land surface temperature ranges for each and every year and how it is changing. So most of the times we get to see some of the hotspots which are always there like this particular middle part. Has been repeated many a times. So once we have land surface temperature and we see it with the community wise. So for 2013, what was the temperature for that particular community name or the boundary area? Similarly, we do it for all the reasons. 13 didn't have much data set. So we consider it from 14 and onwards. We merge it with the community data set once again. Now, this is the comparison of our square which was again shown into the PPT. So what we found out was like on an average, it was giving us around 0.71% accuracy in terms of explaining the variability of different, different indices that we have taken or parameters that we have considered with respect to land surface temperature. Now in the formula I have mentioned about the community constant. So each community will have its own value. So we are not considering 13 and considering from 14 to 2020. Each community goes with particular like temperature factor, right? We divide that with number of years and we get to know what is the community constant for that specific community which will look something like this. So for this particular community from 14 onwards, this was the community constant temperature that was derived. Now, this is just like kind of preprocessing and checking how those particular communities and constant look like. We are also taking average constant. So once we find out the variables which are needed, we try to find out what is the average constant for each and every variable. So that we get to see if the in DVI value is varying in this particular manner, what should be the average button that it should follow in general. Now, once we have all of this if required, we can also apply the RoboScalar for normalization. But we found out that these results are even like in the similar manner without applying RoboScalar. So this is application of both. You can just run through these particular notebooks just to get the feel of it. Now, let's take one particular specific community and perform on that. So we took one particular community as Greenview Industrial Park. This is how it looks like. And we applied all the variable constant that was generated. What we get to see is the predicted temperature which is say the land surface temperature for that year was 21.71, but model predicted it as 24.17. So this changes with respect to each and every and this is the difference like what is the exact different difference that it was giving us. So generally when we took what is the main difference that we have got for all of these particular grids, it was coming around say 2.2 raised to minus 10 logarithmic. So in general, it was giving kind of good results as per the explainability was r square value was minus, sorry, r square value was 0.72. We saved those particular results with respect to the specific value again so that we can visualize it on any specific platform. That's it. This is one of the main matter. Anyone have any questions, please? Okay, thanks to the guys from Grumman here for their very in-depth talk. A round of applause please. We have time for some questions. If you wanna ask one in person, please step up to the mic. Yes. Hello, thank you very much for that talk. The level of detail was great. I think it'd be good for a tutorial as well. So I'm just interested to know what are the worst case scenarios around the world that you found with the data explorations and can this be used for non-urban heat island effects? Can you still hear us on the remote? And what's happening? Can we give that another go? Okay, let's try again if you can repeat exactly. Hello, can you hear me? It looks like everyone's gone. Anyway, just to repeat my question. So what are the worst case scenarios around the world that you found today or that you predict in the next couple of years? And also can this be used for non-urban environments? Hi, can you hear me? I can hear you. Yes, just a little question like, can you give example about non-urban environment? Like you mean to say, with respect to climate models or something like that? Yeah, I'm just thinking about, say, non-urban developments around the world and changing vegetative landscape, if that makes sense. Definitely. So this model can be replicated anywhere where the land satellite imageries are available. So landsat imageries are generally available on land and during the daytime period, wherever area, whichever area that is included in the landsat particular platform around the globe can be included for this type of study. Though urban developments happen at more faster rate than non-urban areas like say vegetation area or ecological areas or something like that. It might not see that much effect on changing land surface temperature because land surface temperature is something related to skin temperature of the object. So if there is a vegetation, it will have skin temperature of the vegetation. If there is water, it will be skin temperature of water. Same way, if it is a terrace surface or say building surface, it will be the building surface temperature. So yes, answer to use, it can be applied to these particular sectors. Any question that I'm missing right now? Also what are the worst case scenarios around the world that you found today or in the next coming years? So one of the worst case scenario will be with respect to the soil loss in terms of like soil is losing its particularly and not other into different crop yield productions. That is one. Second thing is the overall climate change which is happening through urban morphological factors. And third is rapid deforestation, which might lead to such kind of event in large scale across different parts of the city at the same time. So these are the major parts. Okay, thank you. Any other question? Do we have any remote questions? No. Maybe I have one question. Why did you choose Calgary? And these are kind of anomalous areas that showed up in the trends. Were you able to explain those and maybe make use of that information? Thank you. So out of which we have created one application which tells a scenario modeling for like areas like Calgary or pigeons in Canada, which faced mass heat rises in some of the months and that caused this. So what they are primarily doing is going towards sustainability development. Now, once you get to know at which particular pocket you have what type of urban heat island, then you can do actions on the basis of that. So this is giving insights, actionable insights which they can directly work on and they will exactly know which particular location or which particular grid I should work on to like the geographic location to improve it. So as the new satellite comes in, satellite images comes in, they can also monitor what are the positive or negative changes of whatever that action that has been taken on the ground. So, yes. Was it, is there a concrete example you can give us of something actionable from the city? So city right now are like say for one of the major thing that we got from this exercise was building height is was helping into reducing some of the urban heat island effects because the height of the building was throwing the shadows on the respective areas with respect to the sun angles. So for those particular areas where we have massive you know, big land parcels for say, for example, residential and all of that part, generally having high elevated buildings and not largely, you know, wide buildings can help the most. But this particular sections are still in process. So we just have given them the, you know, application, they're working on it. They're checking the different, different hotspots and preparing their climate teams. Okay, thanks. If there are no remote questions, then I think we can finish up and a nice round of applause for our colleagues in India.