 Before I start my talk, I have a small disclaimer. I've never given a talk before. This is going to be my first talk. So please pardon any who's in between the sentence. So apologies in advance. My name is Akka Singare. I'm a volunteer at DataCant Bangalore. Professionally, I'm a senior software engineer at Inferred AI. DataCant Bangalore is an organization that brings together data scientists with high social change organization to maximize their impact. Our process involves where we first explore the problem statement that the organization is trying to solve to see what their pain points are. And then we take the data and to see what it offers and where it lacks and then prepare the data. And then we dive into the data to prepare prototypes where we take multiple approaches. And then we meet with the organization to refine our prototypes and then find and deliver a final solution. In this year, we collaborated with Pollinate Energy. Pollinate Energy is a social business organization that brings life-changing products such as solar bulbs, water filters to millions living in poverty in India. So their main audience is urban makeshift communities. These communities build a makeshift house which looks like this. They have a built arboline surface over their roof. Their current process involves using Google Maps where they zoom into Google Maps to each location in Bangalore to find these urban makeshift communities. Urban makeshift communities are abbreviated as UMCs. I'll be referring it as UMC henceforth. So this is the aerial view. When you look at the satellite image, this is what UMC would look like. The blue regions that you see here is nothing but the UMCs from an aerial view. So now that we understood the pain point, we wanted to, data kind wanted to use satellite images and geo-coded data in order to identify the UMCs. So now that we had set our goal, we started gathering the data and also preparing the data where we received the locations of the UMCs that were mapped by Pollinate Energy which were just latitude and longitudes of the locations of UMC. So the image in the middle is a pictorial representation of the UMCs, UMC data provided by Pollinate Energy. Please mind that just for the presentation purpose, this does not reflect the real-world data. So the image on your left is the mapping of whole of Bangalore map where it's closely-nated red dots representing each location of Bangalore. So now that we had the UMC data from Pollinate Energy, we wanted to create data that would represent non-UMCs, which could be lakes, buildings, schools, anything that is not in UMC. So we took the Bangalore grid, the grid that we created and subtracted the locations that contained the UMCs and created a sample of data that represented non-UMCs. We worked on the assumption that Pollinate claimed that they had mapped all the UMCs. So for now we assume that all the UMCs were mapped in this particular data set that we had created. So once we had the data, we started exploring the data. We realized there were duplicates in the data and this particular data set was highly skewed where non-UMCs were multitudes larger than UMCs and also the latitude longitudes provided by Pollinate Energy. Some of them were not accurate enough, meaning when we looked at the satellite images, there were no presence of UMCs in that particular image. Probably the presence was there in the vicinity of the image, but we excluded those image. So in total, we received around 469 community locations from Pollinate Energy. After cleaning the data, it reduced to 252 community locations. Once we had the data cleaned, we started prototyping where we took multiple approaches, where our two major approaches was to use satellite images, where we used a traditional image processing technique of blob detection where the UMC would be the blob and a supervised machine learning approach and also a transfer learning approach using convolution neural networks. The other major approach was to use geocoded data where we use multiple sampling techniques along with different machine learning models to see what worked best. So computation approach. Here, to show you what the data looked like, it is like this, where the blue regions that you see here in three of the images are the UMCs from an aerial view. We also sampled around 2000 images from the non-UMC data that we had sampled out. So the first approach was blob detection using HSV color space. So when we took the RGB image and looked at the image from a HSV color space, we realized that the blue regions in the RGB color space were mapped to a purplish blobs in HSV color space. So what we thought was we could use thresholding to segregate these potential UMCs from the RGB image and we applied a thresholding values which resulted in this half-filled blobs that you see here. So in order to make these blobs more pronounced, we used the image processing technique called dilation which would spread out the white pixels in the image and the result was a more pronounced blobs but also resulted in small blobs that were too small enough to represent any UMC. So we used another image processing technique called erosion which would basically spread out the black pixels in the image so that we retain the large blobs that you see here. So using the coordinates of these blobs, we mapped those coordinates onto the original image to show the presence of UMCs with respect to this approach. This particular approach gave accuracy of 85% which was very good but the drawback is that this particular approach was not scalable. Polynite energy not only caters to Bangalore but also to Chennai, Mumbai and Kolkata. Especially in Kolkata, the urban makeshift communities do not have a blue roof. So in that sense, this particular approach was not scalable. The next approach was a supervised machine learning approach where we took the original image and created manually created masks where the white pixels would represent the presence of UMCs. We did this so that we could sample out patches out of these images, patches out of this image that would represent UMC images so that we could feed it to our supervised machine learning model. So for this model, we use features such as a global entropy, intensity histogram, Gabor filters, heralic features and local binary patterns. And we fed these features to a random forest classifier. The resulted image would look something like this where the red rectangles that you see here are the models we are seeing. There are UMCs present in that particular patch that you see there. So this particular approach yielded accuracy of 95.2%, which was very good, including even the recall and precision were quite high of 0.95. Although this is one of the approaches that is scalable where we can scale it to Chennai, Mumbai and Kolkata. The only drawback with respect to this approach is that we have to manually mask out the images which would represent UMCs. The next approach was transfer learning approach. Basically transfer learning is that you train a algorithm in a particular, in a given domain, images in this case and use that knowledge from that model and apply it to another domain which is similar to the domain previously trained on. In this case, it was images, but specifically satellite images. The pre-trained model that we used was inception V3 model. So with this model, we achieved the accuracy of 67%. Although the accuracy seems pretty low, we just wanted to show that deep learning has a potential for this particular use case. The hyperparameters used for this particular training were the standard hyperparameters that have worked well for images. So but we are hopeful that by fine tuning these hyperparameters, we can deliver a better accuracy with a better solution to pollinate energy. With this, the computer vision approach concludes and the next major approach was data proxies approach. What I mean by data proxies are these entities in a given location such as schools, hospitals, highways or railway crossings. So these particular entities are what we call data proxies. So in this particular approach, we had a total of 1,889 places which contained both UMCs and non-UMCs. There were, the data had 13% of UMCs and the rest were non-UMCs. So the hope here was that to use this different, the presence of these different proxies such as schools, hospital and see if they influence the presence or absence of a UMC. So what we did was we took account of different proxies present in 500 meters around a given location. We also observed that large number of data proxies were very sparse in nature. What I mean by that is that out of 1,889 places, only 56 locations had railway crossing in the vicinity and especially only one location had a airport in its vicinity. Also 1,657 places had a school nearby. So in this sense, the large number of data proxies were sparse in nature. So we started with our first approach of under sampling where since this was an imbalanced data set, we would under sample the majority class to come to a level of minority class which resulted in 252 UMC locations and 252 non-UMC locations. For the data preprocessing, we took three approaches. One was to standardize the counts, the count of data proxies that we feed to the classifier, then normalize the counts and then minimum maximum scale the counts. So in this approach, the evaluation metric was accuracy because now that we had balanced the data set, we thought accuracy would be the best evaluation metric for this particular approach. Having tried multiple models such as linear regression, LDA, KNN, decision creation and others, we found that support vector machine along with standardization gave the best accuracy of 86%. And also the F1 score was pretty high. The true positives were also very high and the false negatives were on the low end. So this approach gave really good results that are also scalable. The next approach that we took was stratified sampling. Stratified sampling is that when you create a training and test set, you retain the ratio of the original data set. So in this particular approach, we had, initially we had around 97 data proxies. After exploring the proxies, we realized, as I said, the data, I mean, the proxies are very sparse in nature. We realized not all proxies contribute to the overall presence or absence of UMCs. So we used two feature selection techniques. One was to use a chi-square test to select the 50 best features. And the other was to use principle component analysis where we retained the principle components that had 95% variance. In this approach also, the pre-processing, we took two approaches. One was to standardize the counts, the counts of data proxies and the other was to normalize the counts. In this particular approach, the evaluation metric was not accuracy but recall. Since the data set was imbalanced, we wanted to make sure we would avoid identifying UMCs as non-UMCs. So we tried multiple models such as KNN support vector machine gradient boosting, random foresh, but the best of them all was nahi-base, which gave a recall of 98%, which was very good. Unfortunately, the accuracy was on the low end of 31%. This was because a lot of non-UMCs were classified as UMCs. But with respect to pollinate energy, it was okay to misclassify a non-UMC as a UMC, but it was not okay to misclassify a UMC as a non-UMC. So all in all, data proxies is one of the approaches that we recommended to pollinate energy, where because mainly because this particular approach is calibrated to different cities, Chennai or Kolkata, because we can use these different proxies to see how best they influence the presence or absence of a UMC. Unclosing thoughts, the major barriers that we faced with respect to in this sprint is that, especially with respect to the data, where to get the right data and to get enough data to use for our models, we see a huge room of improvement where we can fine tune all our approaches, different models so that we can deliver a better solution to pollinate. And also, we see a very high potential to create impact in different sectors. Currently, pollinate looks at selling solar bulbs, but there may be other organizations that probably has some sort of education delivery to these particular communities. In that sense, it has a potential for great impact. All in all, data kind is an organization that is open to volunteers of different expertise and also different backgrounds. So if you feel like you want to contribute, please visit us on our official website. You can follow us on our meetup group to see when we will meet. And also, you can join our Slack group. I would like to take any suggestions that you have for us to improve the model in case if you miss something, please feel free to suggest. And also, I'll take any questions, if there are any. Please stand up during your questions. Yeah, can you please tell me what are the parameters which are differentiating UMC and non-UMC? And what is the technique you are following for taking the image of the geo-coded data on the images of UMC and non-UMC? Suppose you were telling that in Kolkata, the roofs were not there. So how you are basically locating those raw data images and then transferring to cleaning and then the modeling part comes? Sorry, I missed. What are the parameters which are differentiating non-UMC and UMC? And how you are taking the raw data from the satellite images? And how you are differentiating that the, how you are predicting that this can be UMC or non-UMC before cleaning it? So the idea was that when you look at, just to say, when you look at an image that contains UMC, you'll realize there are, especially in industrial areas of Bangalore, most factories have a blue surface, blue color on the surface. So when you look at, especially the entropy, entropy of this particular region, you'll see that a region that has UMC will have higher entropy, meaning that it's not a normal, smooth blue surface. But if you look at some other industrial roof, it will have a low entropy. In that sense, that is mainly how we segregate UMC and non-UMC. Yeah, really interesting work. I had a few questions. So one thing is that did you consider combining the approaches so you use your data proxies approach sort of separately and you use your CNN approach and the random forest approach on the images separately? Have you given any thought to combining these two approaches and seeing what could happen there? Have you thought about it? Yeah, we've actually thought about it. Right now, we are at the stage where we just created all the prototypes of different approaches. Going forth, we are looking forward to mixing different approaches and see how best we can leverage the outputs of different approaches. And the other question I had was around like, how did you manually labeled, because in your random forest, did you go to the slide with the random forest output? No, the next slide I think. Yeah, there, right there. So here you've marked out the segments where there are, where the model things that are UMCs. But in your CNN approach, I didn't notice that you had done this. So is that, how do you know that the model got it right? So basically we extracted patches out of the original image where different, so we used the masks to segregate patches where we would label it as UMC and non-UMC using the masks. So that is how we, then we kind of stitched the output together to represent this as the final output image. Hello, okay, you are actually talking about that basically classification. Actually it seems like a localization problem. Major problem will come from the localization. Suppose you are getting 90% accuracy in classification, but you are getting something 50% accuracy from localization. So your actual, actually accuracy will be coming down to 60% to 50%. So actually this is not a classification problem. I think it's a mostly actually, you have to more concentrate on the localization. How we localize the exact area and how to evaluate these things, this classification accuracy. Did you use some, okay, area-based evolution or it's like a label wise, okay, I label this thing. This is something, these things on these things. How did you evolve these things? So you mean to ask that when, about the evaluation metrics that we used, right? Yes. So when I talked about especially the data process approach, we generally looked at the data set of the ratio of UMCs and non-UMCs. Yes, exactly, this is a two class problem. You just level the data, whatever the level data, and you just classify these things. But actually when you actually deploy in the real scenario when you'll come into that actually deploy your model or something like that, most problem will come from the segmentation and localization problem. First you have to solve classically, this is not a classification problem. Classification, if you use some fantastic classifier, it will perform something 95%, 99%, but still in real life, it is very hard to actually reach 60% accuracy. So don't concentrate on classification, random forest, whatever it is. Any classifier, right now very sophisticated classifier will come, you just train the data, you will get a result. But you just most concentrate on that, how to localize these things appropriately that to level this data. That is my suggestion actually. Thank you. Let's meet offline so that we can talk more about this. Yes, okay, thank you. Any more questions? Yeah. Hi, I'm Ravi Khan. I just have one suggestion. Maybe I don't know if you've tried it or not. Did you try out, sorry. So did you also try out looking at the day and night patterns of those areas? So I don't know, there was some study in I guess the US or somewhere where they looked at the day and night patterns of certain areas to basically plan out how to deploy resources there. So in what happens generally, these areas that most of the nearby areas would be brightly lit because those will be like airport or railways or some buildings, urban areas basically, but these slums areas, these probably won't be that much lit. So that can maybe give you some kind of identification that this is a possible urban makeshift area. So just a suggestion. Thank you. Thanks for the very informative talk, but I got an idea that there must be some research already going on on this front. That let's say you are like passing all the satellite images and that, so now with the solar thing come in the picture. So why can't we use the same techniques to find out where, which are the places where we can install the solar panels and generate electricity out of it. So probably this is like, I'm not sure whether any research was going on in, I forgot the organization, data kind, right? Yes, data kind. So probably you can try out this as well. Sure, sure. We can talk about this more offline also. On a closing note, I just wanted to thank, this is the team that made this possible. So I'd like to thank DataKind and Anantil for this opportunity. Thank you.