 Thanks, these are like the best MCs ever. Yeah, so, hi, my name is Grace and the title of my talk is Creating Map Visualizations with Open Data and Folium. So that was the title of the talk that I submitted to the PyCon organisers. Actually, I'm talking about Ruby, not just him. So this was the title, but as I was making the slides, the talk slowly morphed into a story about how I use Open Data to find myself an awesome home to live in in Singapore. Yeah, so before I launch into this long story, let me give you some background about myself. So I've been working at 99.co for like a year and a half now. And I look at these like amazing properties in there. Okay, so let me step back. 99.co is a property search portal. So for those of you who don't know what that means, basically Airbnb, but for much, much longer term rentals. Yeah, so I've been working at 99.co and I look at these properties every day, right? So one day I decided, hey, why don't I like move into one of these amazing properties myself? And 99.co's app is like pretty awesome. I can enter my search filters. I can enter like my budget and like the number of rooms I want and everything. But even after entering all these search parameters and filtering out the rental properties that don't meet my expectations, I'm still faced with literally like thousands of listings to choose from. So now I have a problem, right? How do I choose the best possible listing in Singapore? So I thought to myself, you know, I don't actually have to guess. I can use data to solve my problem because Singapore has this amazing collection of open data. So as a starting point, because I and everyone else can't afford a car in Singapore, I decided that I really want to live close to an MRT station. So the next thing I wanted to find out was which MRT station was the most affordable without being too far away from places like work and play. So to do this, I decided that I would create a map visualization, calculating the average price per square foot around each MRT station. Furthermore, I would color code the prices so that I could create a visualization that would make it very easy to digest all this information at a glance. Okay, so the first step was to map the MRT stations on Singapore itself. So to do this, I got the open data from MyTransport.sg. And MyTransport.sg has this amazing collection of transport-related geolocation data. So they have everything from the arrow markings to the road crossings. They even have the locations of those curved mirrors you see at blind spots. Yeah, and for our purposes, I needed the train station data. So when you download this data, it's in the form of an S3 shape file. So S3 is the Environmental Systems Research Institute. And they define the file format for this kind of geolocation data. So the geolocation data that you download has two files, the shape file, which ends in SHP, and the data file, which ends in dbf. So the shape file contains the actual location data, whereas the dbf file contains the attributes of this location data. So both of these files can be opened with the shape file library from PyShape, PYSHP. So what you do is you open both of the files, and then you feed it into the reader, and then you can extract the records. So as a first step, just as a sanity check, I wanted to see how many records there were in the shape file. So I just printed the length of the records, and we see there are 101 shape objects in this file, which roughly corresponds to the number of MRT stations you would expect in Singapore. So this is a good first step. Next, I wanted to see what kind of geolocation information we have in the shape file. So in S3 shape files, S3 shape files can contain a number of different types of geolocation data. So we could have a point in each record. We could have a polygon. We can also have 3D points, including the altitude. We can have 3D polygons. So yeah, I wanted to find out which one of these types was in our file. Yeah, so we printed the type of the first record, and we see it's type 1. So all we have is a point. So now we know we have 101 points in our file. Okay, and next I wanted to find out what kind of information we have about these points. So I printed the fields in the shape file object. And for S3 shape files, this always contains something called a deletion flag, and this is something to do with the way the records are deleted from the file. So yeah, we'll just ignore that. Instead, we'll look at the other item in the list. So the first value in this list is something called STN NAM. And the thing about S3 shape files is that the name of the attribute can only be a maximum of 10 characters long. So most of the time it's kind of abbreviated to something shorter. And looking at this, you can kind of guess STN NAM kind of sounds like station name, right? We're looking at train station data after all. Okay, the next value, C, just tells you what kind of field this is. So it could be a character, a number, or a date. Yeah, so in this case, C stands for character. The third value, 250, just states the maximum size of the field. So in this case, we know now that we have a character string that is at most 250 characters long. And finally, the fourth value has something to do with the number of decimal places that are allowed in the field. But this is not a number, so it's not applicable. So we just ignore that as well. Okay, so now that we know a little bit about the fields, right? We can iterate through the records to see what kind of information we have. So in this example, I'm just printing the data about the location data. And then the second thing is the location data itself. So iterate through the first three records. And we see that we get stuff like Crungy, One North, and Galang. So this kind of confirms that STNNAM stood for station name. And then when you look at the location data, right? For those of you who are familiar with the location of Singapore in latitude-longitude coordinates, this looks a little strange because the center of Singapore is actually 1.35 degrees north and 103.8 degrees east. So what are these like crazy numbers in the tens of thousands? So it turns out that the Earth is not exactly flat like one you mentioned yesterday. I mean, the Earth is not a sphere, it's not a perfect sphere. It actually looks like crazy bumps and dips on the surface and everything. And combine this with the fact that the tectonic plates are always in motion. This makes global positioning systems that were designed for the entire globe are actually pretty inaccurate for smaller regions. And you don't really get much smaller in Singapore, right? So it's actually quite bad for Singapore. So what Singapore did was they came up with their own positioning system called SVY-21. Yeah, and SVY-21, okay. So if you're dealing with latitude-longitude coordinates, right? If you travel all the way from the west to the east, you're only traversing like 0.5 of a degree, which is like quite sad. So if you move for a few hundred meters, you're dealing with like fractions of a degree. So in SVY-21, you're actually moving a whole 50,000 units. So when you move a little bit across the island from Tuas Achuang Bay, you're dealing with like more reasonable units, like hundreds and thousands rather than like 0.0001. Okay, so this presents a problem, right? Because a lot of the geo packages out there accept latitude-longitude coordinates only. So in order to work with these packages, we need to first convert the SVY-21 coordinates into lat-long values. Fortunately, someone has written a SVY-21 package that does exactly this. So we just import the SVY-21 package and we use the compute-lat-long method. And yeah, we've converted our SVY-21 format data into lat-long values. So this looks a lot more like the lat-long values that we're familiar with for Singapore. Okay, so now that we've converted our geolocation data to lat-long values, we can finally start to make some maps. Hey, so for my first attempt, I used this toolkit called Basemap. And for those of you who are familiar with Mapplotlib, Basemap is actually a toolkit that's contained within the Mapplotlib, like a whole collection of libraries. So I'm not going to go into too much detail here, except to say that Basemap being a Mapplotlib toolkit gives you very fine control over where you can plot. So you can control the lines that you plot, like the fill color and so on. So in this example here, all I'm doing is I'm plotting the outline of Singapore and I'm filling it with a gray background. So yeah, we've got Singapore. You can also add Mapplotlib-style markers. So here I'm adding a blue circular marker of size 12 at this coordinate. And it shows up kind of in the south of the island there. And you can also add text. So here I'm just labeling the points as Orchard MRT. So yeah, it's great. We have a map. And this is all well and good until you see what our JavaScript friends can do. So JavaScript has this library called Leaflet. And Leaflet is amazing. So this is what you can do in Leaflet. You can create these very rich interactive maps where you can drag, you can zoom, you can create heat maps like this. You can add pop-ups to markers. You can add HTML and CSS elements into these markers. It's really, really amazing. So suddenly our Basemap maps, which are static and non-interactive, are really dissatisfying. But thankfully, there's one rule of the internet. If it exists, there's a Python package for it. And that Python package is Folium. So Folium to the rescue, you import Folium and in one line, you can map Singapore immediately. So compare that to the 13 lines it took in Basemap to create that gray blob. In one line or in Folium, you can create this rich interactive map of Singapore. Yeah, so what I'm doing now is I'm going to iterate through the MRT records and add a marker for each record. So the first step was to compute the lat-long values from the SVY-21 values. And then I use the regular polygon marker method in Folium to add one marker for each MRT station. So I'm going to add a, I'm going to fill each marker with a dark gray color. I'm going to add four sides to each marker and I'm going to make it a radius five. So this is where you get a dark gray diamond at each MRT location in Singapore. Furthermore, I'm going to use the station name data that we extracted from the DBF file and add it as a pop-up to the map. So yeah, now when you click on markers, you get the station name. Yeah, so now that we've plotted the MRT stations in Singapore, the next step is to compute the average price per square foot of each MRT station. And furthermore, I'm going to color code these according to price so that red stations will be more expensive and blue stations will be less expensive. So it's kind of like a heat map going from cool colors to it's the warmer colors. Okay, so to compute the average price per square foot around each MRT station, I got transaction data from URA which is the Urban Redevelopment Authority and HDB for the HDB stuff. So URA takes care of the private transactions and HDB maintains a separate collection of their own transactions. Okay, so when we look at the transaction data, we see we run into a problem. We have the address information but URA and HDB don't actually provide the coordinates of each address. So how do we extract the coordinates from this address information? So thankfully, there's an API, OneMap, which provides an address search endpoint so you can submit the address and it will return you coordinates. So in this case, I submitted, you know, 71 IRRAC Crescent and this endpoint returns me the coordinates in SVI 21 format, but we've already seen how we can convert that to that one. Okay, so great. We've converted both our address information and our MRT information into that long coordinates. So now we can use this library called Geopie to compute the distances between each address and each MRT station. For this, I use a method called Great Circle which is a method to compute the distances between two points on the sphere. So some people have asked me like, why can't you use Euclidean distance to compute the distance since Singapore is really small and the earth is essentially flat, but in addition to not assuming that the earth is flat, Great Circle also is very convenient for converting these distances into meters or if you're not a fan of a metric, you can use yards. But yeah, but this is great because I wanted to focus on units that were close to the MRT station. So I basically discarded all the transactions that took place more than a kilometer away from the MRTs. Yeah, and that's just the code. Yeah, so what I'm gonna do now is I'm going to iterate through the MRTs again and I'm gonna skip ahead a few steps because I wanted to focus on the mapping part of this, right? So what I've done for each MRT station is I've pre-computed the average price per square foot for all the transactions in like the past three months, one kilometer away from the MRT. In addition, I've normalized these average PPSS and what this means is that I've basically converted the price per square foot to a zero to one scale. Yeah, and I'll elaborate on why I did this in a while. Okay, so some MRT stations are not going to have any information associated with them. For example, like Changi Airport MRT has no units like within a kilometer of the MRT station. So for those, I just set the value to be zero and then I color coded those markers as white. For the MRT stations with information, I used the weight values, the zero to one scale to calculate a hue. So if I submit this zero to one value to the HSV to RGB method in the Colossus Library, what this does is it converts the zero to one value into basically an RGB code. So yeah, it's basically like a collection of three numbers, each number corresponding to the red, the green and the blue of the color. And then I converted this into a hex color code, which is what the Folium Library accepts. Yeah, so basically you get a map like this. And the last step was to add the actual price per square foot to the pop-up message itself. So now when you click on the map, each marker will tell you the station name as well as the price per square foot associated with it. So yeah, this is not like, okay, so this is basically like a radial rainbow, right? That starts from the central business district in the south here. You see for like those stations in the south, they're really expensive. They're like $6 per square foot roughly. And then basically it's cheaper and cheaper as you radiate out towards the fringes of Singapore where it's more like $2 per square foot. And this is not exactly super surprising. What was surprising to me though was how much the prices dropped off from the CBD. So you know, if you move from like the downtown station, just one or two MRT stations out, the price already drops pretty drastically to like $4 per square foot. And what this means is if you care about living close to the CBD, right? The optimal thing to do is to move like one or two stations down, one or two stations away from the CBD because the moment you exit the CBD, the prices drop like super drastically and then they don't really drop very significantly until you reach like the extremes of the island, right? Yeah, so if you zoom in to this region, you can begin to see how you can make like the locally optimal choice. So for example, if Dyeong Baru has like an average of like $4 per square foot, and say you find a listing that is like a kilometer away from, within a kilometer of Dyeong Baru, and you see it has like a price of $3 per square foot, then you kind of like can get a sense of like, maybe you've stumbled across a deal. Yeah, so Martin was joking that, you know, there's like tons of machine learning talks at PyCon this year, so I couldn't help, but mention a bit of machine learning as well. At 99, we computed a model that basically like, finds the features that are most important in determining the value of a property. So besides the proximity to the CBD, the next most important location related feature is the proximity to the MRT station itself, the nearest MRT. So the most satisfying part of this site project for me was that I managed to find an MRT, a listing that was like one minute away from the MRT, which kind of implies that, you know, it should be more expensive than average, right? But it was actually cheaper than the average according to this map. So I'm happy to report that I was actually able to make a data-driven decision about where to live in Singapore. Yeah, thanks to OpenData and Folium. Yeah, so I actually want to show you a bit of the map and some of the code in a Jupyter notebook after this, but before I exit the PowerPoint, I very quickly wanted to thank Martin and Fiona because they basically encouraged me to give this talk despite some hesitance in the beginning. Yeah, and you know, Martin is like the most underappreciated hero advocate of like women coders in Singapore. So thank you, Martin. Yeah, I also wanted to thank the Pileadies community because they've been super supportive as well. Yeah, and this is my GitHub page. Yeah, my GitHub name is Miri66. So like the map and the Python notebook containing some of the code are on this page. Yeah, so now I'm gonna switch over to the browser. Sorry, I don't like switching between PowerPoint and browser because I find that pretty disruptive. But anyway, this is the map. Yeah, so like I said, Folium maps are pretty awesome. You can drag, you can zoom. If there's any MRT you want to check out, let me know now or not, the one. Okay, so actually, I found this quite interesting. There's actually some pockets of value here and there. For example, Caldercourt, right? You can see it's blue, whereas everything else around it is kind of a green or cyan. So I don't know why it's cheaper there, but it's like $2.53 versus everything around it's like $3. So that's pretty cool. You can start to find good deals around Singapore. Yeah, this is quite sad. Like the lawyer, Tanjong Bhagra, like $6. Yeah, so that's it for the map. I wanted to also show you a bit of the code. So this is running on our Jupyter notebook. So this is basically exactly what I presented just now, iterating through the records, blah, blah, blah. It only, it's quite exciting when you get to the Folium part because it's more visual, right? So this is the part where I first plotted the map of Singapore. And then you can change the coordinates to like re-center your map. So I just re-centered it to NUS. If we add stuff like 41 or whatever, it should go to like, that was just a random number. I have no idea where we are now. Maybe we can zoom out. Nope, that's a bad idea. Okay, we're actually somewhere in Russia or something. Yeah, okay, nevermind. Yeah, so you can also set like the zoom, the default zoom that you want to start at. And another cool thing about Folium is that it includes, it pre-includes a few tile sets. So there's one called Stamen Toner. If you don't like roads and buildings in your map for some reason, you can, oh nevermind. That doesn't remove the maps, but that doesn't remove the roads, but it changes it to black and white. So there's a number of these pre-built tile sets that you can use in Folium. Yeah, so this is the part where I added the MRT stations. Yeah, so you can do a lot of these markers. You can like change the color. So just want to make it like blue. It takes a while to render because I'm just iterating through 101 MRTs. So yeah, you can really see it's blue. So let's like blow it up. Changing the radius. Yeah, so now you can see that blue, right? Because it's like huge now. Yeah, and you can also change the number of sides. So say I wanted like, I'm sick of diamonds. I want like pentagon. Yeah, and you know, just now I plotted the rent values as circles, right? For those of you who are like observant enough, I actually cheated those one circles. They were actually 30-sided polygons. Yeah, so they look kind of circular. Can't tell, right? I wonder if you blow it up. Can you see if they're like circles? I mean, the polygons. No, they're still really huge circles. Yeah, okay, so that's pretty much it for the Jupyter Notebook. I also wanted to like plug the package that I wrote. You know the model that I mentioned? You can actually do a lot more complicated things with open data in Singapore. So what I did was I took the URA data and the HDB data and I put it through this algorithm that's becoming very popular around the data science community, which is a XGBoost. Some of you are nodding, right? It's like super popular now. It's winning like every Kaggle competition. Yeah, so what I did was I wrote this package that makes it slightly easier to use XGBoost. So it takes care of some of the pre-processing steps for you. Basically, all you need to do is construct your pandas data frame, do some feature engineering, and then you dump it into this package and it will generate all the results for you automatically. So it does all the things like one hot encoding, like dropping the features with no variation and so on automatically for you. Yeah, so basically what this model does is it predicts the values of each property and it also tells you the importance of each feature that you submit to it. Yeah, and if you wanted to do like some more like advanced stuff, you can too. You can like sub-sample the transaction so you can like train your model faster so you can iterate faster. And you can also do like ensemble prediction so that you can split your data into like a bunch of sub-samples, train multiple models based on each of these samples, and then you know get like an average or like an average of the predictions and this has been shown to like yield more accurate results. Yeah, so that's basically it. I'll take questions with Danny. What can you look at a workability score? Oh, no, but that's a very, very interesting, yeah, that's a very interesting point. Let me see that asset away. We're actually calculating workability. Three data sets away. What three data sets? Working distance, road size, crime, I think. That's a three thing that you then put it in your model and then you can control very interesting results. Okay, yeah, yeah. In Singapore, I'm sure like crime is like just uniformly low everywhere. Maybe like Desker or something like that. Oh, that in Malaysia, right? Ooh, god. Let's not go there. But yeah, workability score is like, I think it's a very, very good idea. It's related to this whole MRT thing, right? Because basically no one can afford cars in Singapore so I think workability and transport are like super important and I do agree that there's so much data we can use to compute workability score because my transport also has like cycling paths and like walking paths and stuff. So yeah, definitely go into that. Question, no question? Yeah, no question. Thanks a lot, Supervisor. That way to say. Thanks, thanks a lot for me. Is there a way to split a player combined by features of the unit, for example, number of rooms, number of players? Oh, yeah, yeah, yeah. So, yeah, that's a really good point and I'm sure there's a lot of like breakdowns we can do, right? Besides the number of rooms which is a extremely important determinant of PPSF. Some people have also suggested that I do a breakdown of like HDB, condo because you're totally right. Like there's so many features that go into predicting the price. I really should do a breakdown of this at some point. Anyone else? Yeah, okay, thank you. Absolutely. It's a really wonderful talk to HDB. Thanks a lot. Covering for you. Playing with it for a while. It's really quite an interesting, exciting package to play with. There's obviously a fascinating project that you're working on. Is this something like hit up somebody with all the stuff you're doing that you're working on as well together? There are a lot of applications, actually. Yeah, so this part, this is on an iPython, yeah, this is on the GitHub, but the rest of it isn't. I could add it at some point. Yeah. I'm working on the print text side, so also you see there's a lot of APIs along with some of them. Obviously, a lot of other things that you can play with. Okay, yeah, cool, cool. Yeah, I will open it up. Thank you. Thanks. Okay, that's another question. I'll just give it a second. Yes, please make this available. Yeah, okay, we'll do it. Okay, thanks.