 Well, good afternoon. My name is Julin and we will be talking about an open source project that I have managed for the past couple of years and basically this project came out of DataKind ASEAN. So in April 2016, we got together with the different chapters in the ASEAN regions or the different DataKind chapters in ASEAN regions and we decided to see what we can do as a region and not just individual chapters. So if you have never heard of DataKind, they are a non-profit organisation and they help non-profit organisations to basically these are data scientists in their day job and they volunteer their time and their expertise and they look over different non-profit organisations' data and basically help them organise their data, see what they can do with it and even basically do a very professionally run data analytics and present it back to these non-profit organisations. And that's what they do and basically in 2016 they decided let's see what we can do together, you know, not just as individual chapters and we healthcare ASEAN came out of it. The idea is that diseases don't keep to their own country boundaries, right? So that's why we think that why not we centralise all these open data, see what we can do out of it because most healthcare in different countries, they don't share data with other people. So this is a lax but nowadays actually some countries are opening up like the US and some European countries have decided that you know it's actually more beneficial to share healthcare data. So for us in Asia is still quite close, healthcare data is still quite close but we still try to see what sort of open data we can find and we decided to start on denghi and malaria first because it's a focus of this region, right? So these two diseases are quite common. So how do we do that? First we need to formalise the idea so make sure that you know this is the problem we have, this is what we're going to do. So what's the adjajian? In fact statement, we want to use open data in order to generate an open platform of descriptive and exploratory analysis of malaria and denghi. So that individual non-profit and or organisation will benefit from it and we will help reduce the risk of these diseases locally because if we don't put it in a central place, we won't be able to see the movements of these diseases in the region. It's just bounded by your region, your particular country or even your particular state which is quite small, right? Next what do we do? Of course we have a plan, right? Whatever project it is, makes more the UN have a plan, you know, things like that. Always have a plan and this is what we do, come out with a plan. First thing we do, we research on the data, we do the research. What does that mean? We find out everything we can in terms of open data, what sort of data we will be facing, what sort of data we can collect. For different countries, there will be different languages and we are very, very lucky to have people who really support us, who knows the different languages in the region. So we have Datakai Vietnam, Hanoi, I think it was Hanoi, I can't remember. Not sure which chapter, but somebody from Vietnam who really find out for us what sort of data they can find out of Vietnam and this is good because nobody else knows Vietnamese, right? And stuff like that. So we make sure we actually document what data we have first to see what we can play with, what we can work with. And from there, we collate it, we have a plan out of this data to figure out how we can use it, what sort of data we will be looking at, what sort of granularity we will be looking at. And of course, Malaria and Dengi, straight away we realise that we can't get away looking at the weather data as well. So these three things are key that we will be looking at. The geography, the diseases, the cases and the weather. And another thing we realise out of our research is that for diseases, we look at it in a weekly granularity. So nothing else would work. Daily definitely doesn't work because when you report a case, it's when you have already been infected, you have a fever, that's not when you actually get infected. So we usually look at it in a weekly basis. So from there, we decided that if we can, wherever we can, we go and find the weekly data, which is not true for all countries. We only get yearly data. Some countries we have none. Some countries we still manage to get monthly data, which is better than yearly. And Singapore obviously have the cleaners data set, not the longest bar period, but it is the cleaners data set that we found. So we stuck with Singapore as our base. So how we plan our project, we have the team. Obvious, we try to find everybody the right number of people for each section, the analysis, the research method, know how for the different countries, knowing of the different languages, the different resources. We list it out so that everybody can see where they can get the different resources, the links, the trailer board of course, the different trailer board. We managed over 2 years to have 3 trailer boards on top of the Github project page. Whatever is in progress, or the different tasks, that's a separate thing. So of course, every single step of the planning is documented. Documentation is very, very important. First thing everybody look at is what's the problem statement, what is the goal of the project, and then we structure everything very cleanly so that people know where to look for for the information, where they should write their source codes in, where they should put their documentation, where they can find all the different granularity, project set up for any new beginners, clean file structure as well for our Google Drive. All these links are of course inside the resources, all these links are inside there. So everything is documented, everything is online, nothing is kept hidden so that we can find information wherever. So those are really keys for most open data projects or open source projects, documentation, make sure everything is out there online. You can find it. Of course, breaking down the task is one of the key things that we find that really, really useful. Make sure the tasks are as small as possible, as easily completed as possible. So most of the people who are willing to help, they want to get something out of it. They don't know Python, for example. They want to learn Python. Give them an easy task, they get to learn Python, we get to get the job done. So those are really helpful. Basically, it's a quick pro quo. People usually want to get something out of it. They want to feel good out of completing a task. We break it down so we have the automated data collection for every single country, every single site is a separate task. And then the cleaning is also a separate task. The visualization is a separate task. Whether it's to do, we put them in different cards, whether it's blocked, whether there's any issues, all of it documented on this trailer board. And of course, last year, Github happily have a project board for us. Before last year, Github never had a project board. Now they do, and really it helped a lot. So now we transfer all the tasks that we used to have on trailer board, transfer it onto Github. So we have all the small little tasks. Underground APR, this is a bug. Refactor the download script as small as possible so that people can complete it as easily as possible. All the improvised tasks, all the tasks that are completed so that we need to create unit tasks on. And there's a couple of others as well. The documentation is here. Documentation also, they need a specific task for documentation as small as possible so that it can be done. And that really helps people to find out what's needed to be done, what has been done already, who is doing what is in progress, these are important keys. And all of this information online, people can find it easily. And we need to set up a framework. So for any project, if there's no framework, it gets out of hand very quickly. So this year, we really make sure that we follow a very standard architecture. Right here, we have the documentation, the docs, the models, notebooks, reference, reports, source and test. Obviously what will be the key for most people will be the source folder and the test folder. Nobody has tried the docs folder yet, which is unfortunate. But yeah, so ideally basically, whatever you write in the source will be reflected in the docs as well as the test folder. And of course, the README. The README is key. The README will give you all the architectures and the requirements. So all this is in the README file. We have our project organisation and each folder, the description of each folder, what the intention for each folder is all document that is going on. They can jump in at any time. They can come back at any time and they'll still know where the project is at. And of course, this here is one section that I made the mistake of not doing this from the start. So we actually implemented the testing, the unit test folder, like quite late. I think a couple of months back only. And this was like almost 1.5 years into the project and that was my biggest mistake. So I only implemented it when the project has gone out of hand, when there's bugs coming out that the owner of the project me can no longer handle. So yes, test actually should be the first thing to do. It's easiest to do when the project is too small. It's easiest to maintain from then on after the project start getting out of hand is already somewhat too late already. Right? Ya, so that's biggest mistake. But oh well, never mind. We can still turn around. We can still carry on. So I will learn from my mistake and hopefully we can still clean up all the bugs. And yeah, so testing and logs are quite important. But again, in case you have to prepare the structure for them. So we created the folder and have the task ready so that people can pick up the task, have a place to put in their test case and yep. Test and progress is one of those things that a lot of projects struggle to find the steady progress to keep up your projects. A lot of we've seen a lot, a lot of date projects. Ya. So as long as this project did not become a date project, I'm actually very happy. So luckily for me, I found a platform for me to get new blood into the project. So we have regular events. So last year we had a weekly social coding where I introduced this project and I introduced a couple of, I usually give the members a couple of selections of open source projects and majority I would say more than 90% chose to work on the healthcare data project. Well, mainly because it's python and this is one of the bus programming language nowadays so everybody wants to learn python. Ya. So they find this healthcare ASEAN and that's how I get new blood into working on this project which is great. I mean for me to actually get the project to the point where it's a lot of bugs means that it's growing you know, it's grown to that stage. Ya. So for any projects especially for open source projects we need to really, really preserve your right. And that's the key. So it doesn't matter if it's bugs at the end of the day we will get our goals happen. We'll have the data and we'll have this. So I think this one is Dengi Fever for Singapore. Nah, nah, this one is Dengi Fever for Malaysia. So you can see there's always a spike regardless of the year and for 2014 there's a larger population you know, there's a population growth so you can very easily see growth and of course 2010 is the there's a bigger trend and there's always a spike roughly around the same time because of the monsoon season. I don't have the weather data superimposed but the weather usually the monsoon season is around here and then we have the Dengi Fever after. And this is Malaria for Singapore so Malaria actually follow a very different trend from Dengi Fever in Singapore. The Malaria we found that if you map it out on the map Malaria tends to happen around the ponds areas reservoir areas and that's where the Malaria usually found. Dengi Fever are usually found in areas where high population lots of HDB flats then you get high Dengi Fever cases and they also follow slightly different weather trends and yep that's it if any of you have any questions about the project anybody interested to contribute what are the types of people who even hope that would use this data and they're like public officials or healthcare organizations healthcare organizations would be our key so if you see here we identified that healthcare organizations would actually be the ones that would be most benefit from ours our project so yep individual and or non-profit organizations would probably be the ones individuals can't do much but they are the one that supply us with the data so for any of these countries that actually the government don't open up their weekly disease cases actually we find that sometimes the public are the one that they have a stake into the issue right to them it's key they collect and give us this data this data and we they scrape from them and we put it together and then hopefully yeah they help healthcare organizations and now it's up from the public to to get usable data so yeah so the public depending because if let's say you stay in Singapore we don't issue because the government just public size the data they very clearly but let's say Malaysia and if you stay in high population areas are the ones that have most of these diseases so let's say Kuala Lumpur and you stay in Taman and you know the statistics this reported in the newspaper so these people pretty good they collect this information they put it on a portal they don't store the data unfortunately the problem is that they don't store Singapore also actually there's a couple of people who really diligently record all this data from NEA and they report the dengue clusters in Singapore for example the first step when you that what works in this project is to load the screens to load out the data automatically download data so data scraping yes why it's not already available on GitHub one of the thing we find about storing the data on GitHub is that we want to make sure this is reproducible code and this is open data that we can get out of the internet if the site is down it's no longer considered open I mean it's also useful I do see the pros of storing the data in case a particular site just goes down you know at least you have that data already available we have another folder for that we try we don't always keep track of it but what we do is that we also for initial research we actually have this data stored on Google Drive so yes we do have a folder that we store the data but on GitHub it's automatically downloaded we utilize real time data scraping and anybody who comes into the project brand new they can just run one script they will start scraping all the data into the data folder and how big is that data? no, not big not big couple of Macs it's not big so the biggest problem is actually finding the necessary data even for Singapore where we have weekly data it's less than a Mac it's a couple of kilobytes that's it but you need to download straight from this also yeah that's one of the other key of this project that we want to do is basically all the codes needs to be reproducible by somebody else at home yeah so the script would break if the data sources changed yes and we have seen that happen especially underground underground has recently changed the IPI so it broke some of our codes so it became a bug it does happen yes but that's the thing especially open source you have to keep up with the the internet sorry you have a question? I'm just curious a bit as the weekly parts which is mentioned that they can share they will be to share the data with you but as far as I know I mean because I support for the market and for the comedy which is specialized in healthcare and pharmaceutical product and I know that probably in Vietnam I'm not sure but the project is supporting some of the major countries but when it's going to Vietnam we have something called a privacy law in healthcare so I'm not really sure and it's really strict and my company does some problem with the privacy law no it doesn't actually infringe on the privacy law at all because these are actually just numbers the case numbers there is no names no individuals nothing it's just case number in a particular district yes Vietnam we we actually have one of our Vietnamese data kind member who really gave us the data on Vietnam actually which was really good and another member from women who quote from Thailand she's the one that found the data from Thailand for us because again it's in Thai and non Thai speakers find it hard to go through all the Thai sites even with Google's translate so for Vietnam for example the minimum that we do get is from WHO World Health Organization which is a yearly data but we still take the data even though it's yearly so some countries we do notice that we have zero data from Cambodia also from Myanmar but that's nothing we can do about that Q Sir, I've found that there are a number of You said I apologize but how long have you been working on the project? Since 2016 so the research we did our research over the period of April to August September of 2016 and only around August then we really start on the coding data scraping site Many government agencies has already consumed data coming out from the system? No No So governments they only care about the countries Ya they don't really care what happened in other countries No No Not so far now We haven't reached that stage yet unfortunately but that is our goal that it can be beneficial to somebody You mentioned that you have a lot of data scientists participating anyone from healthcare that's kind of helping provide some insight into Occasionally yes Occasionally we have somebody from healthcare who basically look through our data and look through our scripts and our methods and to be honest they weren't 100% happy with how it's done but they didn't say no all together also Ya they were like sitting on a fence haven't decided whether or not it's going to work but ya these are things that you won't know until it's done sometimes ya Is there a good place for people if they're interested in getting involved is there like and you have kind of the Trello idea is really good at playing everything is there like a guidance of how someone might get started and what they should do Ya so we have it all on GitHub everything is open so if you are interested basically I actually have more information on the yep the meetup page actually have slightly more that those will give you links to the Trello Board the GitHub doesn't but actually the GitHub itself has everything you need the Trello boards are outdated so we have now moved to use the GitHub project board yep and we also we have a Slack account but the Slack account you have to email datakind.sg for permission to join a Slack account Thank you very much Cheers