 It's on. Okay? It's on. Yup, it's on. Cool. Hello everyone. My name is Hawi and this talk is going to be about exploring Singapore's open data APIs. A bit about myself. I was previously a software engineer at Dropbox working on internal tools. Currently, I'm at Bright Technology Services, which is a small software and data science consultancy in Singapore. So, if any of you have software projects that need extra hands or data analysis or insights that need to happen, you can come look for us after and you'll see we can take on your project. And that's my Twitter handle, GitHub, and I have a blog at lihaoie.com. So, that's enough about myself. About this talk, this talk was originally a blog post on my blog, and it's going to be about open data. So, open data is the idea that your data is publicly available for anyone to use without any sort of strings attached, without any contract that needs to be signed, or without any agreement or partnership that needs to happen. So, I assume most people here are doing some sort of data science job. So, who here is doing some kind of data science job? Who here makes use of publicly available APIs as part of their job? So, part of the nice things about open data is that you can immediately use a meeting to, for example, make an agreement within you and the other company. And that lets the data be used by many more people, maybe students, maybe individuals, maybe small startups, which would not be able to sign a contract with a large multinational or with a Singapore government. So, I'm not affiliated with any of the organisations whose open data I'm showing, but I'm just an enthusiast who happens to have used some of them So, the Singapore government has a bunch of open data providers. They are somewhat disparate. So, you have Data.Gov. SG, you have LTA, which does public transport and other transport data. You have things like the library board, the NEA giving your PSI for how much haze there is, things like the URI data service for telling you what the prices on condos are getting sold in each area. So, there isn't any real consolidated list of open data providers in Singapore and this list is also itself not complete. But from this, there's already a lot of interesting things you can explore and do with this data. So, for example, this has a whole range from not very interesting reports to get published once a year. So, this is open data that's public but it's not really something that's interesting to a developer or programmer. So, there are small data sets like this comes in an Excel spreadsheet which tells you how many people are in each ministry in Singapore every year. Again, not particularly interesting. You have live data which starts getting more interesting. For example, if you want your app to make use of the current Singapore weather, Singapore NEA National Environment Agency has an open API which can use a query, the weather forecast for next 4 hours, the PM 2.5 days, that sort of thing. And you have LTA exposing the list of every arrow that exists on every road in Singapore. Or every convex mirror that exists in Singapore. Those convex mirrors they put around corners so you don't crash into cars going the other way. So, they have this huge data set of every single one in Singapore and if you need it, you can download it. In addition, they also have the real-time APIs which I think are personally the most interesting. So, for example, you can query what are the bus arrival times for any bus at any bus stop in Singapore in terms of, given a particular time, when is the next bus going to arrive at this bus stop, when is the next bus going to arrive at this bus stop, and that's a data set that all the apps, for example, like bus.sg or MyTransport.sg these are data sources that those apps use. So, in this short demo I'm going to show off a few things you can make using OpenData. So, one of them is how to make a bus timing board. So, all of you have probably seen these boards available at the more popular bus stops. Telling you when the next bus of each service is going to arrive at bus stop when the next bus is going to arrive and whether it's disabled, friendly or not I'm going to show you how to make a bus trip planner using the data that's available to tell you given how to go from one bus stop to another bus stop which bus stops do I have to go through along the way to transfer and which bus services should I take to, for example, minimize the number of transfers or minimize the number of bus stops or some function of those and lastly, I'm going to show you how to make a taxi heat map so, who here I've seen in one of these visualisations it shows you where all the taxis are in Singapore Has anyone seen these? So, it shows that red means a lot of taxis at Changi Airport, there's some in the CBD there are no taxis in Dinchukang So, if you lived there then too bad So, I'm going to show you how to make one of these in next hour or so using the OpenData that's available So, all the code for this presentation is available on this github repo, so if any of you want to follow along or later you want to come back and revisit the code and try running yourself it's all up there in a form that you can easily dump into your Python interpreter and run So, to begin with let's look at how to make a bus timing monitor like that So, we won't have any hardware or any fancy UI but I'll show you how to get the core of getting the data out and manipulating it and presenting it in a reasonable way So, this bus timing data is available at the LTA Data Mall So, my transfer SG content data mall if you google LTA data mall it will bring you here So, this is a somewhat old looking site that contains both static data for example there's all the geospatial data I showed you earlier about for example where the roads are in Singapore where the road crossings are if you want to make your own model of Singapore Road Network as well as real-time dynamic data So, for the dynamic data you have to request through the API that's updated say once a minute, once every 5 minutes you first have to request API access to get it but this is more formality than actual requirement so you just put in the name, email and they'll send you your access code and as far as I know the main purpose is so that if you start doing weird things at API or start overloading them they'll know who to contact to see how they can work it out so if you make like 10 million requested data to API and the server start falling down they can phone you up and tell you to back off and the documentation for the API which perhaps the most interesting only available as a PDF document I'm not sure why but despite only being a PDF document there's everything you'd expect in a nice set of API docs so if you look through they have a change log which is more than most startups having the API docs and they have a nice listing of all the different APIs they have available so for example you can look at when the bus arrival times are so they have some examples for example if you let's see if we can make this bigger if you request data mall, microsoft sg, lta or data service bus arrival passing the bus stop number as a query parameter based on blockback that shows you all the buses which are going to arrive at that stop and metadata about them in this case it says service number 15 is estimated arrival at this time currently is at latitude in case you want to show it on some map or visualization and shows you subsequent bus that's going to arrive and it does this for all the different services that will arrive at that particular bus stop similarly LTA also provides data for all sorts of random things like it provides API for showing traffic incidents so where the traffic incidents is in Singapore if you want to put them on the map or it has API for showing road works so if you have app like Waze for example i'm going to show, given your traffic route where are the things along the way that may affect you these are the APIs that give you easy access to those things so that's enough talking about it so let's look at some code i'll put this down a second let's format this nicely make this a bit bigger so people can see it can you guys at the back see this okay cool so this is a standard ipython rappel so if any of you want to follow along it's ipython using python 3 the first thing i'm going to do is load my LTA key which i had requested earlier and immediately you can start making request and you can start import request and you can start making request to the LTA data service so for example this pulls down the data for the traffic incidents which are ongoing in Singapore and you can see that you get a bunch of JSON which is slightly hard to read and if you want you can explore it using a standard python tools it's a dictionary i know that it has particular keys and using those keys i can dig further into it see what it is and get some interesting insights for example i can see the length of the list is now 22 so there are 22 ongoing traffic incidents for example heavy traffic on pye towards truss, heavy traffic on adam road so on so forth so you only really need 4 5 lines and the code to get started with this data and from then on it's up to you what you want to do with it so as mentioned earlier we are going to make a bus timing monitor so the api that interest us the most is probably the bus arrival bus arrival api so see guess before we do bus arrival one thing that would be nice is to get bus stops so this is a request that will pull down the bus stops the list of bus stops that are available in singapore um it similarly has it wrapped in a value thing that you can pull out and you see it only gives you 50 bus stops and that's because as you may have seen there's a skip parameter that you can pass in that will let you get a next 50 so if i pass in if i pass it in a skip so currently the last bus stop it showed is one raffle zinc along the cold highway and if i pass in a skip um i get cathedral of the good shepherd along victoria street and if i want to i can write a small function that will fetch all this stuff for me because it's not difficult to make multiple have while it makes multiple request until the all your data you want so in this case um we start with no output we fetch it given given a skip parameter which is the length of output so if you have 100 results i skip the first 100 and ask for the next 50 um and then we keep fetching until we keep fetching until the length of output is zero and that's when we know there's no more results and we can return the whole output as the return value so you can easily pull down for example all bus stops and you still pull down 50 at the time um for this particular dataset of bus stops it doesn't really matter how up the date it is because the bus stops don't change every minute or change every hour so for this you probably want to pull it down and cache it so you don't need to keep hitting their servers over and over and now we can see that all bus stops we have all 400 4859 bus stops each bus stop has let's look at the first one for example each bus stop has a bus stop code which you've probably seen on the bus stop on the bus stop poll the description, hotel grand pacific lat and long in Singapore and the road name so if you want to if this is helpful because many of the other apis will require the bus stop code for example if you want to find out the bus stops which are along orchard road something like stop for stop in all stops if stop road name goes orchard road set night that's not right if let's try if oh i smell orchard bah or card orchard road, there we go so we can ask what i say sorry singular description so these are the bus stops which are along orchard road and if you want a particular bus stop for example you may want op you want the ones with description is op orchard station and get this oops, now you can get this bus stop code so that's how you can do basic things with the data you get from the LTA data mall api and this will come in useful later when we need to use the bus stop id in order to ask LTA for the data about a particular bus stop so now we have list of all bus stops you can write a function to help us search for bus stops quite easily so i can find the stop which is orchard station i could find the stop which is ymca for example and the multiple stops which are related to orchard station or related to ymca and then after that is a matter of making a separate api request let's make this bigger it's a matter of making a separate api request to the bus arrival api passing in the bus stop code as a parameter so for example if i go back here let's say i want the ymca bus stop i can fetch this depends on this helper function i defined earlier so let's define let's pull that in so i don't need to keep writing request.get headers and then it's a matter of fetching the data for in this case the ymca bus stop yang we found earlier so again this is somewhat long we can shorten it so what's in this data we can see it says there are 3 elements it is a dictionary it has same it has the bus stop id or data metadata which maybe we don't care about for now and a services key so now we can pull out the services see how many there are see 29 of them let's assign this ymca bus equals that ymca bus you can look at one services to see what kind of data it gives us so in this case you can see that this is service number 106 and next arrival time is this in UTC which we can convert to Singapore time in a moment and it shows you the subsequent arrival and for the subsequent subsequent arrival it just doesn't give you data so this is a relatively common thing in these open data apis where some of the data is either wrong or missing and you just have to deal with it as a consumer okay so one thing that we could do is we could define a function that would prettify the time the UTC time into something that's more reasonable for example giving us the time in minutes from now which is the same thing that they show on those big bus stop arrival boards so you could do this using a third party library for example PyTG is popular, arrow is popular but for now I'm going to avoid using those libraries and just use the default Python and some string mangling and regXs so you can paste this into the console you can define a function that will given a bus stop description find the let's try a find stop option back here if I function given a bus stop description it will find the bus stop which match the description and pull down for you using the bus arrival time endpoint past with a bus stop code and it will give you the service number and the next bus arrival time so for now let's just ignore the subsequent buses next next bus and next next next bus so I can say bus timings for let's say opposite altered station and you can see that we get 106, 101, 123, 140, and 16 all with the hard to read UTC timings and we can feed it through we can feed it through a simple 4 comprehension to pre-defy all the timings using this narrowly function we defined earlier that converts into a relative time and you feed it in and you get out a nice list of all the next bus arrival times in opposite altered station which have things along altered Boulevard in altered Singapore so it looks mostly correct some data missing because night rider buses aren't running at the time and right now and these other buses maybe are not running at this time either some data looks a bit weird 16 seem to be arriving in a very long time from now so again some data is going to be broken and maybe ignore it when you're distributing it to your users or ignore it when taking trying to do your computations but other than that you can see 65 is arriving right now 106 will have to wait another 11 min 123 will have to wait another 13 min and so on so we can feed this into other stations for example feed this into the YMCA station and YMCA which has a lot more buses both buses which are running and buses which are not running and buses which are going to arrive in a long long time from now and that's all you really need to get a simple bus arrival time monitor using the APIs that LTA provides so this doesn't show you how to put it on a pretty website or put it on a hardware LED display screen or put it on a mobile app but the core of what you need to the core of what you need to get these APIs is on order of a few dozen lines of code to perform the um a few dozen lines of code mostly to perform the effects, the bus arrival and point in order to get your data down then after that it's trivial to just mangle it in python to display it however you wish to display it um so that's the first demo the making a bus arrival time display it's a terminal and not on a board at the bus stop but functionally you can kind of say it's the same more or less the next demo will be making a bus street planner so we'll use the same data set for this as we did for the previous one and the goal of this is to say given I'm at bus stop 1 and want to get to bus stop 2 what bus stops do I have to stop along the way and what bus services do I have to take to transfer at those bus stops in order to get to my destination so the data for this is available in a different endpoint um let's put it out the data to do this will be available in a different endpoint called bus routes so bus routes you basically have to make an endpoint to this underscore bus to this bus routes API which I can show you if I ask for fetch so bus routes you don't need to pass in a bus stop ID for this endpoint let's call this routes so let's see what this is again routes is a dictionary interesting thing seem to be in this value parameter inside value we have a list of 50 that's 50 things that look like this so this tells you that each item says at this bus stop in this case 14141 we have SMRT bus number S49 stop in the sequence so it's not a lot of information it just tells you the single stop and if you look at element 1, element 2, element 3 you can see every single element just says a single bus stop is stopped at by a single bus service in bus service in the whole of Singapore so if although it's not in the convenient format we can aggregate this data and fetch all of it rather than having just 50 elements so we fetch bus routes let's call this all routes stops you can see that it has to make a lot of fetches again because this data set includes one item for every bus which stopping at every bus stop so this will give it another 20 seconds or so to put it all down again this is something that you probably don't need to refresh often because the bus routes don't change very often so if you want to use it, pull it down once a day once a week, probably fine remember correctly, it sounds like 20,000 data points so this is how you end up making 5 million request a day to the api and getting a call to a back off if you keep doing this, they'll find you and tell you to stop come on skip it so they told me to try and keep it below 2 or 3 million a day but there isn't any hard limit some services will rate limit you automatically for them they don't and I think it's kind of manual, someone will notice this thing going up and they'll find out who is doing it I don't think the 3 million a day was really a hard limit if everyone starts using 3 million they'll probably also start getting unhappy okay so we are done that only took 4,400 network request and you can see all route stops now has a huge list of every bus stopping at every bus stop in Singapore so again, this is not a very useful data structure but it contains all the data we need in order to find out which buses can take me from one bus stop to another bus stop um so let's go back here so I wrote some code that I guess I won't explain too much detail due to time but the first one the first snippet converts this format of every bus in every route every bus in every stop into a let's see dot keys keys let's say pprint dot pprint all routes dot keys mister also not that useful but basically the key to this dictionary if I take an example a tuple of the bus service number or ID for example 970 or NR2 or something else as well as the bus direction because some buses go 2 directions and the ones going the so-called backwards direction are represented by 2 in the direction number so presumably there's a bus 38 going outwards from the stop from the terminal and there's a bus 382 which is the 38 coming backwards and both 2 different routes so for each of these routes we can now ask for example given bus 38 in the backwards direction we can now ask which bus stops are the bus stops along that route so this is slightly more useful data structure in a bus stop which routes can get me to which other bus stops which is what I need for the path finding algorithm next step would be to create a dictionary which tells me which bus stops can take me to what other bus stops directly so in this case which routes have which bus stops now I want to know which bus stops can take me to which other bus stops and if I can do that all stop connections come on so if you look at for example stop number 49071 this data structure tells me there's only one bus coming out of stop number 49071 and that's bus number 925 and bus number 925 will bring me to the following bus stops so now that we have data structure let's just look up how to get from any one bus stop to any other bus stop you can then use that to calculate how to get from any bus stop indirectly to any other bus stop that you may require 1 or 2 or 3 transfers so that is done if we only care about number of transfers and don't care about other things like distance or other weightages number of stops this can be done simply using a breadth first search so i have a breadth first search here which people may have learnt in school so who here has written a breadth first search before so some fraction of you have written breadth first search before so it's based on your intro computer science intro computer science algorithm problem you have a queue you put in your starting id and every time you have a wire loop until the queue is empty you pull out the thing at the top of the queue see what it can connect to and take all those and put them back in the queue so this lets you gradually expand the set of bus stops you can see until you hit the one you want in which case you stop and see how did you get to that how did you get there so again this is i'm not going to explain the code in total due to your time but it's not a lot of code and can browse it on github as i mentioned earlier and so so for example you now define this function so if you look for the let's see so if you want to look for the bus route for example from opposite orchard station long orchard boulevard to anglo-chinese school which is where i used to study opposite orchard station the bus code 09022 anglo-chinese school has a bus code 40069 so we can feed that into our we can feed that into our breakfast search and you can see that this will tell you to take bus number 174 to tulip garden and after that take bus number 48 to anglo-chinese school starting at opposite orchard station and if we don't trust this because there's no reason to believe that it's actually correct since i just put it here we can verify it manually so if you look at LTA bus what is it 174 is this one it bus 174 okay this should be it so so this is opposite orchard station 9022 and that takes you to tulip garden down here and if you then look at bus route number 48 tulip you can see it takes you from tulip garden up here to anglo-chinese school down here so this is a bit of a manual verification you can do more fancy things if you want but this shows that this is a valid route to get from opposite orchard station to anglo-chinese school again this is a relatively naive algorithm so because it does a breakfast search and doesn't take into account the number of bus stops it takes from one transfer to the next transfer it doesn't take into account the distance between the bus stops which are all actually available in the LTA API it only finds a route to minimise the number of transfers and that route may not be ideal but if you want to write this yourself and spend another 30 minutes or 1 hour converting into an extra algorithm so you can take into account these extra features you can then weight it however you want for example finding a route that would minimise the number of bus stops along the way plus the number of transfers times some multiple or find a route that minimises the distance travel regardless of how many bus stops or how many transfers you have to take so another example if you want to find the route from bunle let's say bunle cc which is 21439 to some Changi airport stop like 95011 we can plug those in 21439 95011 and you see it gives us a route from bunle cc to bunle interchange to after from Linsen Road to Changi airport and if you want you can go verify this yourself on the bus website, I won't do this right now cool so that is the second demo so that is making a bus trip planner using the open data that's available in the LTA API not quite as pretty as the thing you see in google maps but functionally somewhat similar the last demo I'll show off is making one of these fancy taxi density visualisations which let you see like where all the taxis in Singapore there's a whole bunch at Changi, there's a few at Orchard few at Galang and so on so I'm going to use data.gov API instead of LTA API for this but in practice it's about the same data that you can get from both sources so the data.gov API looks like this so you also have to register to get the account API key separate from LTA API key separate from any other API key if you sign up for under government and this gives let's make a single API request to say taxi availability which doesn't say here but will give you the lat long locations of all the taxis in Singapore and I can show it to you now in a moment so if you go back here let's go to taxi.py which has my demo for this first thing is to pull up the pull up the data.gov API key using you new passing a datetime so they let you query historical data as well so you can pass in the arbitrary datetime of when you're on the query, this is a few hours ago and you new passing API key and you run that and you see taxis gets you a huge pile of lat and longs so this doesn't have as much metadata as a buses doesn't tell you which taxis driven by which person or what their license plate number is but it gives you enough to draw the fancy map we showed earlier so you can pull out the data relatively easily so I believe it's under features tick not dot key taxis features I guess there's only one feature maybe there's only one feature dot keys then I want to look at I get the geometry what is this coordinates and now you get a list of basically tuples of the lat long of every taxis in Singapore so let's see what I was going to do for that so if you want to make a visualization like what we have here the first thing to do is probably to find out what's the bounding box of what's the maximum latitude minimum maximum longitude for the rectangle of the world that we care about has these taxis inside so since we know the data is a list of tuples it's a matter of 2 4 completions to pull out the max and the min or a single for loop to pull out the max min max min of both x and y so oops so let's do that so this loop of everything pulls out the max x min x max y min y min x for example and from there we can calculate we can then put the taxis into buckets that we can then put onto a map so let's say let's assume I want a map that's 40 buckets wide then how high would that map be since we know the max x and min x it's a matter of doing a bit of arithmetic to find the ratio of the difference between max min x max y min y and use that to calculate how many buckets high or map should be so given a width of 40 it says we should have 23 buckets tall we can initialize a grid to have a bunch all these buckets so the 24 buckets and the 24 rows of buckets and each row of buckets has 41 buckets in it then we can take the taxis and take each data point and chuck them into a bucket using the x and y so we we take the difference in the min x and the difference of y in the min y and you find which bucket the taxis should be so instead of having that key from 1.03 to 1.05 you then get a number from 0 to 23 for example and now you get I call it grid it's still not very interesting at least not like this what if i printed out nicely give myself a bit more space ya you can kind of see that it's a map of Singapore so you can see the whole bunch of zeros in the central catchment area you can see there's like 2 taxis in Limchukang not sure who you're picking up, hopefully not the ghost you can see the 4 bunch of taxis in Changi Air 2806 and 410 taxis there so this is itself not very pretty but you can pretty it up with not too much difficulty the first is i guess to remove the zeros so for n in row dot join what's this doing right justify for n in row so i'm just gonna have if n equal 0 else so that gets rid of all the zeros and now it's quite clearly what it is you can quite clearly see all the outlines like this person tuas this person jurung island so on the next step would be to try and add some color because one of the reasons why this thing is so easy to read because immediately you can see where the red spots the blue spots, the green spots are so to do that so i define the small function here that uses nz colors to color the output of the, to color the number of taxis in each cell so simple color sorry how i do this let me think a bit so in order to color the cells reasonably we want to color the cells base on their percentile rather than base on their value because for example the bigger cell but cell number 400 even half of that even half of 400 is still more than any other cell on the map and so if you color it by value like for example 10% of max 20% of max, 30% of max you won't get any reasonable colors out of it so this is a small snippet of code that will tell us the percentile of each cell so if i want the rankings and ask it what is percentile for a cell with 37% it tells us its 95% or more or less and now using a slightly more complex for loop w is not defined i guess i need to move this out i don't that's not right okay let's move this out and put them outside and if i run this we can see you get a slightly more colorful rough map of what the taxes in Singapore look like and you see all the deep blues where there are almost no taxis the blacks where there are totally no taxis and the reds where there are high densities the last part of the demo is to use a slightly more high fine grain coloring so bash lets you do 8 colors like this lets you do 16 colors both mon terminals also let you do 256 and 65k colors in the terminal using different escape codes so these 8 colors were done using these slash 023 bracket 31ms which you can find if you go google nz escape codes i define the set that uses the 216 color escape codes which i a bit complicated to explain here but if we run it using the 216 color escape codes you'll see we get a much more colorful map of where things are so you can see quite clearly that the cdh and chungyia are the most popular places for taxis to hang out and you can see that the gradient drops off so the greens and yellows have fewer and fewer the oranges have even fewer and eventually end up with these dark blues and purples which have barely any taxis at all so that's the last demo for this presentation so these are 3 things we built they don't look quite as pretty they do the same thing they pull out the data they let you visualize it and get the same information you have in a slightly different format so yeah those are the open data those are some of the open data data api available there are some limitations as you've seen already like it's messy to have to go dig through different PDF documents to find out how to use api endpoints to sign up for 3 different api keys 3 different government agencies the accuracy of the data is also sometimes questionable bus number 7 probably wasn't going to take 1400 minutes to arrive next time but these glitches in data do appear and you have to handle them and similarly the availability of the data is not always what you want so sometimes the data you want just isn't publicly available in the api and that's just too bad but on the bright side you could make 3 quite neat demos of what of pretty useful tools that people use day to day using open data api in 40 minutes and 100-200 lines of code so if any of you want to go and look at the code again it's all on github so if you go to github, lihowie, opendata you'll see there's a bus demo online with all the code you can copy and paste and there's a taxi demo also with all the code you can copy and paste in your own ipython console so you can try it yourself and i guess that's it so that's the presentation Exploring Singapore's Open Data APIs then i'll be pitch bright technology services so if any of you have data science work there needs to be analysis that need to happen or software engineering work needs to be done we can pick up projects or we can talk to you and see if it's something we can handle my emails here if any of you want to contact me later and data if any of you want to go and try this yourself or modify it or build upon it do you have time for questions? going once, going twice go for it when i look at the code there actually they can calculate the price but if the idea doesn't match it doesn't need to happen so how do you think i don't know i assume it's a simple enough model that you can figure out figure it out and hard code it once and not have to change it too often but i don't actually know what they did but for something like the bus pricing if it doesn't change once every few months it's something that you can figure out each time and just code in i don't actually know what they do if that's actually what they do any other questions? cool i guess we are done thank you